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1ii§^8il^bTai:^-rs^x->filWgi5l 8fc^«i;l 
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m L T * 1 46 T X - >^}gfi5t-r -5 l^ffi^ X - >^ 

±iB<^ffi9" X - >©^n^nstci&wa*p(cMjs-r -s fi& 

fcttS±IB<S*f ^x->coflStt&t>*Mai±^iiJ^ L, 

itmm 1 3 3 ±imtiixmit. t -r s -t y ^ y 
htcM-r?>±fE!^®ai:±8B«ij^s?Pi:*fflt,^T. mm 

tfg^x-y^mx@i:> 

±iBp°ciS)fiij^s-^/3"«m^o H°pKSi)^a*riSffi^± a ^ <^ 

ffl^x->co*^tJb:t)-r-57-i'.'l/i5f U >'yx^ei:^Wr 
■r ^Mia^ X - ^^^to ^ X - ><^tt3xe t > 
1 5 3 ±iimmj:mic ^ k> . ±Mmu=f- x - 

c 1 6 3 ±iB»«f xatc d; t) . ±.tmm^ x - 

CifstcJSi 83 ±iEm^f^^-ytLx. yu^*!^^ 
CM^^ 1 9 3 ±immxmic ^ »? . ±iB!isfaf^ x - 



r 1' • 
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^tcsii*^^^-r5<i4:^!|f^4:-rsii5RjR2 oie^ow; 

[|«5R«2 3] ±IBiiitt^x->«^ ^K^ISfK^x- 
><D«iS(i:fiJ*^;g:Wr§ci:^!t#mi:-r?)llAt<^2 Die 

2 4 ] ±IBISm^ X- a^lBitt^ X- 

mMmzdi ±iBis(t(^x->ii, a^iKfw^x- 

fW L rc Mf^fc^ S U > ^ 5Sftt^ X - set 

itmm 2 6 ] ±82!wm^ x - > a mmu^ x - 

fl^^ X — > T'$) S ii i: ^#§4 i: -r S tt^« 2 2 ieic<D^ 
Clt«:a2 7] ±IEmitt^x->t±, ^^miW^x- 



m^mzai ±8a5S<H^x-^'{i, ^^«fti^x- 



[l»*JS3 1] ±IB1^ffi^et4, ?^^4:-r§±IB3K«»i 
^J^t>VX«^f^-tr i'"^ > h tcM-r ^ ±IB1tM« 4: ±IB 

h^-g-ty<i^ffi^x->^raB#M«TLT*i6. ±iH1^ti^ 
S±IB<!iM^x-:y<D«S14&t>*B8ait*Sii^L, ±IB 

£&«iHij^«spA^msc)p°pKiflij^a-*raa{i;&±[Hi ^ 
x.-y(om^^7i^tt>^!i t^w^t^mn^^z earn 

Cli5l<«3 3] ±IBM«T¥S{i. ±IBiifW^x->% 
fflV^T, ±IBlf7'*«^<D©mWtr7=^:ti^3gt UT. e*! 

2 1 IB^cDBft^^l^MSSffio 

[If *^ 3 4 ] ±tEmm^mit. ±mmM=^ x - >^ 

■r§w*JS2 1 iB®(DB!ii;{tg^Maafio 
ci«*« 3 5 ] ±iB)i?#f ^st±, ±tmm^ ^ ^ - > i: 

S (1 1 i: -r S If 3 4 IB«t<DR!fc«!^p&MS 
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[0001] 

M^^gjg^Sefc-r -i)/^^- ^^l^tb LT^^«T-f §M^sa 



[0 0 0 2] 

[0 0 0 3] coct^ic, miio«!i^fi»i^^^«&tb-rs/-c 

^©i^tttBS^R«> ^(Ol3itA,Et\ m^iS "G. Ahang 
er and T.D.C. Little, A survey of technologies for 
parsing and indexing digital video. J. ofVisual C 
ommunication and Image Representation 7:28-4, 199 

[0 0 0 4] 

^JVTti. 3.— fli, tttti^n/iBi^^^lSjOv'g -y 

L/i^lgilBtcfcitSi^a -y ht±. 7tficr)t©;b"«^i/^i: 
?!)'«^<. il©<fc9^i^3 -y h*tttil-r§tJ^*tD^^ttttl 

/Co 

[0 0 0 5] */c, m<D^m^\^mt LT«. fi'iJ;^{i" 
"A. Merlino, D. Morey and M. Maybury, Broadcast n 



ews navigation using story segmentation, Proc. of 
ACMMultimedia 97, 1997" -^^gaV 1 0- 1 36 2 9 7 

oen^iamfi^riti. g w<D-» yMz.m LTfi^w 

fb-r -5 c: i; *^T'# i: ^ /"Co 
[0 0 0 6] ^fP»lc. ft&tDet^iftHlSiSi: LT{±. 
ffU.S. Patent #5.708,767^^^{C|g«$tlTl/''^ J: 3 

giifb$n/'ct©T't±^c<, H<Dv/a «y hA^|l|i;rtS* 

^■r t o T ^ * j^^-r § fc 46 . rL- if A A^'i;" 
•w^o* tc|5s^ 5 n -s i: I ^ o fc p.g® t o fco 

[0 0 0 7] ^etc^/c. ffi<DB!lt^fttUStBi: LTti. 
01J^{f4#M¥9-2 1 4 8 7 Q^li^fBtCieig^nTV^S 

J^^cc. ->3 >y h^mi:*il^a55>Mtbi;*m^^t)-a:§ 
Cli:(c:J;0>'3 -y h^lStSiJ-rstiOA^^So 
P., CtDt^jRO^^ttaH^tTifi, l«^g|?5^*^->a -y hJ^ 
Wlc^^/tx L /cil-&0*lcffi^^ nfc t ©T'fe o /Co 
[0 0 0 8] */c, {tiKDRJIi^ttmj^liRiiLTti. ^ij^tf 
"H. Aoki, S. Shimotsuji and O.Hori, A shot classi 
fication method to select effective key-frames for 
video browsing, IPSJ Hunan Interface SIG Notes. 
7:43-50, 1996" -^IfBfWQ-e 3 5 8 ^Wi^mttm. 

fi^igii-r-5fcfe{c. w§.z^fdmki^ 3 -J Y^m^t 



[0 0 0 9] $e.{Cx c:n6©j;9^®!;#ttaiSiri«s 

6^ Id L Tb^^tH-r § ii i: *'5T' # o fco 

[0 0 1 0] *^Wtt> ii(DJ;-5*llit{cffi*T;5:*n 

/cttOTfeO. ±iEL/ci^*oe!lt5«ittaiSWOP4M^/5? 

« -r i. £1 i: >gr a 6^ f S t O T' 5 o 
[0 0 1 1 ] 

[»ii^^j*-r§/'ci6<D^ig] ±asLfcaci*)^afi5ft* 
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COO 1 2] 

coo 1 3] %r!L. ±aiLrcgfi^i&^B5c^S*fg^{cA> 
coo 1 4] c:<D<t9'5:*flB^{C*^7!)>i)^li^©pSfi!iS^ 

So 

CO 0 1 5] 

So 

coo 1 6] *58H^^jafflLfcllSfi<Dm^«, ^iii^n 



coo 1 7] *f8W{ctev^T?*^i:-r§trx4-f--^fc 
y;^^!-, ^Sft(^x->-i:v>31t3^^W-r?)toi:-r 



5o fcTT^^j-T'-^'fi. 5(/>(ci$^<oa3^©5iim 

coo 1 83 CCDtfT^si-T^— ^ft±, Bft®&D*gF^<D^?3 
coo 1 93 -typOhti^ cD*;^^{i:<t D 

yp<>hti> "D.Kinber and L. Wilcox. Acoustic Seg 

mentation for Audio Browsers, Xerox Pare Technical 

Report" t^:|B«^^nTV^SJ;•5^c^ m^i£. mP. S 

$6>{C. "SP-t^y-yhl/i. "S. Pfeiffer. S. Fische 
r and E. Wolfgang, Automatic Audio Content Analysi 
s. Proceeding of ACM Multimedia 96, Nov. 1996. pp2 

1-30" {ciB«5nTi/^5J:5(c, z^comm-t^'Spy 

CO 0 2 03 t:<DJ;-5*lf'r:i-f'— iS?{c«3VT3S{a^x 

m^aL-yf)^^iS-t<fytyh:^S i ,. • • S i ^T' 
mLfct^. ^TcD-b^'pOhtcMLT j = 1 . • • 
•. k-1 : i j< i j^^ti^l=S.>^iL-0-m<0-t^yiyhX 

COO 2 13 ^{tt^x-V^ffll/^-SuiifCcfc^T. ki"7=" 




[0 0 2 2] ctOJ;9*^ffi(^x->'i: LTti. tt{i:i¥ 

^m^x-y. jsww^x-^Tb^feo. iinp>{±, 
[0 0 2 3] ccT\ s*iia(Wf^x->i:t±. ^isa* 

ifjV-Zfift <5 rclsb<D ifjl- 1? > y Tyl/ U X A 

Xti 7 X i5r >; > ^^'7 'J XA^ffl V >Ttf C i: Tb^T 

*/-c, g y^isffit^x-^iiti. ^©f-x->i^ 

[0 0 2 4] ^LT. e:£DJ;^^ilS^^^x->'^±. 

[00 2 5] CilT% >'->i:ti. i^y'^v'-^^. ^ 
h f^cD^MffiT ^ 7- ^ l£7-'i'ai:V''o^c-tryp<> hOlt 

[0 0 2 6] ^T. ±KEL/t^mfi^iff'^ii3g^tttB-r 

§3K^W^x->'<Dft{*fi?iJi:LT. 0 2tc^-rJ:'5{i:. 2 
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nut. — »fC> <l<DJ:3^^M-rS©m^x-V{±. Rl 

a-r-s efeis-tr h <o ^^vi/- ^ti >^«iai-r 

[0 0 2 7] ^/i. ±aiLrc:*:^fl^ifx:t1gjg^ttai-r 

(BlIltcfcV-'T, Htf-y^A. B. CD. • • • ilVo 

[0028] :^mBn^mm LrcmM(Dmmt Lrm 4 
1±^S'J^L. ±aiLfcis{w^x-y*iW)e^{ci^tb-rs 

0«. !liitt^x-:yigr8?«T-r-5c:i:tcJ:oT. li-r^i-r 

T^^i-^iST'feS h try ^llOigJU^/KDliji^Jfttb • w 

i|S^-rsci:*«T'^So 

[0 0 2 9] ^^gfSsas^B 1 oti. iRl^tcS^-r J;3 

ita^tu 1 St. m^-ti^yiyhRif^P-t'ifpiyh 

X - ^'{c ^ i: 46 5 ^tb¥ST'* 5 ^ X - ^^l^aigp 1 
6 i:, 2 ocD-b^"^ > hKOilSittlt^iiJ^t-SSiCtl^jfl'J 

aai^i^ai-rsp«T¥®T'feS5=-x->'»iffa5 1 s 

[0 0 3 0] If f=";i-^fijg|5 1 1 0IJ;^(f, M P E G 1 
(Moving Picture Experts Group phase 1) •^M P E G 
2 (Moving Picture Experts Group phase 2) , gKV'>ti 
VtJf^SDV (Digital Video) cD<t ^ ^H^Se-r^tT^ 




(7) 
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X- *^^#^f i> C i: < jtS^Saa-r S C i: *^T' * 
[0 0 3 1] lfx:i-{ry^>h;><tU 1 2{±. ^^t":^^ 

c 0 0 3 2 ] mmwi^mmiii^ 1 3 ti. e^^^-^tijgp 1 
mcDwmm^mtii-r^o Wimnmrnmia^i 3ii. jess 

CO 0 3 3] «^!|#MfitttilgP 1 4 {±. Vf T'^i-^i-fijgp 1 

«o#^s*ttm-rso ^^!t#«s«»tHgp 1 4 it. s.m 
^Pt^- ^ ^^±i^m-t c t ^ < mmmm-t ^ctti^ 

[0 0 3 4] -tri^^^hitgitap^t'j 1 5ti. mi^nwL 
sttffigpi 3 SD*^^#^M«inigp 1 4*>p.^n^nm 

[0 0 3 5] ^x->1^iaigPl 6{i. trv';^--b^^>h 
V^<o dO^x-V^btiaJl 6{i, ^x-XD^aS^ilg^ 

^Si:46fci^. ^2©:7^';^^5^'J>^•"SPg;&ffl^/^T^x 
— ^OSSI-b-y h^^^-rSo ^LT, ^x— V^ttjgp 
1 6t±, 1^tbL/-c^x->^t^g(D^x->^«faPl 8 



CO 0 3 6] !t#@l*ISmi4j9J^gP 1 7 ti. 2 00-t^V 
CO 0 3 7] ^x->'M«fgPl 8(i, f-x-V^mgPl 

6{cj:o^ffl^nfc^x->«jg^8?«TL. m^<omm 
x->')^a5i 8 a. mmt^^^ic. ^om^^n^ 
[0 0 3 8] <i<o^orj:f^m^p^m^mi on. mvA 

CO 0 3 9] ^-r, efe^^i^ffis^s 1 oi±, iBiiatc^^ 
■r^tJ^. BftfiK^^sjas^B 1 ot±, t:x';*-5i-iija5 1 1 



1 on. "G. Ahanger and T.D.C. Little, A surv 
ey of technologies for parsing and indexing digita 
1 video. J. of Visual Communication and Image Repr 
esentat ion 7:28-4. 1996" (C|Eic:SnTt/-'S J;-5*:?3 

CO 0 4 0] M^/^T> ^m^PMm^mi on. xt--^ 



Mmms 1 0 n. mm^mm^m^f^ i 3 



m^MiW-f^o ^m^^^m^mi oic^i^^xn. m 



CO 0 4 1] «eV>T. 



^mTr>o ■r^t>t>. mm'^pfi&mmmi on. ^mm 

CO 0 4 2] iggV>T^ Bft^^l^jaa^H 1 0«, XT-«y 
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CO 0 4 33 *LT, fim^mm^m. l on. Xf-y 
[0 0 4 4] ii<D.i:^^-]ltDMS^MSCi:{cJ;o 



S jHi' > h tcffljStc 7 1> -tr X L /c •? -r S C i: Rjftg i: * 

CO 0 4 5] j^T, i^iafcs^urceft^SfS^aa^B i o 

CO 0 4 63 *-r. HcfettSlf-r;i-5i-fiJtc 

CO 0 4 73 o^'tc, X7--y:/s 2tc*3ij-;§)itiiflad3 

^^asB 1 0 s^^itiiaftaigp i 3 ^^^^nwim. 

So B^#^^50,S^Bl Ofc*3V>T®fflort6i:'S:'5c:n 

mm 1 0 *^^^{t:<o/ci6{c!^m«am t ±ai l /c if f-"^ 
CO 0 4 83 itmai: bT{4> ^-r^m^itcM-rs 



T'feSo c<Diii:*^c,. eiif^#m«tt, w»^^«aa^ 

Bl OT*fflV^Sc:i:A^T't-S«S*#^«<0 10T'fe 

a-iir^i/^A^ w^^^saa^Bi oa. ^jiS-rSJ:^* 

v^m-tr tDKM^#®%atb-r S c i: t T- t So 

[0 0 4 93 mm^p^mmm i otc^v^T. B^^^tcfc 

mmL^nm-t^Ct\,t. mTHi "C. Ahanger and T. 
D.C. Little, A survey of technologies for parsing 
and indexing digital video, J. of Visual Communica 
tion and Image Representation 7:28-4, 1996" (clgit 
$nTl/-»SJ:-5tc. J:<aie>nTi/->So CCT\ 

atent #5.708.767-^^ffi{C|e®$nrV'>Scfc -ptc, 

-txhif^A^. ffi^x-^A^6itg^ifttB-rSc: t*^ 

T^So 

[0 0 5 03 mm.^'fiM^mi ox\,t. "t^^^yv^ 

1llfi!c-rsei^«!{ctiltSt^t4:OYU vfe^ra^r, fe^ 

■v:/^^;l'^/ct) 2 If y h-e-t)->7'-'H-Tlifi5tLfc, 2 

2 3= 6 4 :^7t;<0 1 X h Jf=>U^^ h;l/*fl9V''TV^So 

[00 5 13 COi^^tlX h^*^At±. B5^^£0±i*W 
lS.^mn:mtti\ iintcti^F^1f$B*^-&$nTV>;5:v>o 

Sfil o^i:t5^^•s^x->^tb^ct5^^T. ^I^ScomiW-tr 
X - ^liii T S c i: ^^-r W:^ ^ti^ i: ^ S o 

^^nmmm-t^t^icii. isjsi^DfitBtci^So c<d 
^iff^Srii-^ii^cSiiii^mbbL/cctA^p,, 

S^BlOTti. 7t;OI!fe^^MxN<D:*:t$(DiJ'"WX 
':r-;l/Blfe^'\K?|t^/jNL> (in>g:ffl(.^T^«iffiMi&ti- 

8¥«R^nSo 

[0 0 5 23 ^^{c±SEL/-c©l^!|tm«i:(i:S^S!t#M 



CO 0 5 3] ^-r, ^^s^iasSB 1 0 ^-ux 
§o B^^^^saasgi Otis e<J;^{f. locD^^-fe^* 

> h tc fc 5 « lgS(WIB<05J-^P^a-r /c 4<){C, F F T 
(Fast Fourier Transform ; iSii:?— "J x^g|) fS.^. 

[0 0 5 4] S/c, B!li«^^^laSSK 1 Oti, ^l^fcT-y 

[0 0 5 5] tibic. v^^^mm^mi oii. 't-zt. 

2 ^JlKWr^W^ t ^^*> F F T X'^ h yl'Xti L P C 
(Linear Predictive Coding : i^Jg^lgiM^^t) ^A'"?) 

i 'i^f,^ + l) 
^~ f-b 

[0 0 6 0] (1) fctiv^T, btfti, ^n^ti> 
#^T'fe§o sfeflSi^^ias^B 1 oa. mf^Wfcti. p 

[006 1] i:c6T% ±aiL/c:eft^!^®«^ji&46i:-r 
CO 0 6 2] B5S!«#l&SaaSB 1 Oti. ^iJx.tflll6JC^ 

CO 0 6 3] tc^T*, m.^^y:f)\^i)'^n\z.m^(om 
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[0 0 5 6] ^etCffiCDl^iaSi: LTti, 

CO 0 5 7] T^-f-i' tTT^-c -tr^*^ > hOrt^Tb^ 
t*©eS»lfi^sKi'HifilMT'feS i ^ tc^i; A^^a 

[00 5 8] CtDT^'x-f lfx-<{i. t:Xh^*5 

-i:. i j i:oMTSij^^n/-ciitii[aFtc^^-r 

i>^^S{wttsij^a**dF (i. j) i:s«-r-i)t. 

«7i?f--i' tfT^-r Vpii, (1) ©.t^tc^e^n 

So 

CO 0 5 9] 

C^l] 

... (1) 

5J^-&^#;^5o CKDJt-a-x ll:7L/— A'N^fb (fade) 

LTv^<tt^«D2ocD-typ<>hicov^T(±. -y-^:/;!/ 

CO 0 6 4] ^-clT. Rft^S^Saa^B 1 0 C(DJ: 

c:THi> -fise^^^^ii[«co+>-i/yj>'y73?4*2-p©il 

-r^t)-^. ( 1 ) It^Sfi^HISOni^Tc'^^ 
LTS-rci:A^T'#^«^i:, (2) SPISiWttiB'J^SJP 

(1) (cti. tx hd'"7A-^/^y-X'^^ «t 

So 

[0 0 6 5] (1) tcfci.^Tt±. ^yy)\^mt. 

k ti*46?>nT4of:)^ R*^e^5aa^Bi ot±. "l. k 

aufraan and P.J, Rousseeuw, Finding Groups in Datai 
An Introduction to Cluster Analysis. John-Wiley an 
d sons, 1990" tCgEic^ nT.i: < ^e>nTl/^S k^i^fii 




(10) 
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^^7.^V>ifik (k-means-c!ustering oethod) 

ytDM'Wa (centroid) Xt±C<oa/L,x{8(CjfiV>'+J- 

[0 0 6 6] -73, (2) {Ct3l/^T(i. R^i^^pg^ttS^ 
Bl 0\ii^ "L. Kaufman and P.J.Rousseeuw. Finding 
Croups in Data:An Introduction to Cluster Analysi 
s. John-Wiley and sons, 1990" IClBtE^nTVS k - 
K7';l/rfUXAS (k-medoids algorithm metho 
d) ^rffliz-'T. kffl(Dy;l/-:/^Jgfig-rSo ^LT, «t 

-:/«(C. ±aiL/ci'Vl/-:/©^ K-Y K (medoid) ^ffl 
[0 0 6 7] ^fc. Bft^^l&Saa^H 1 Otctel/^Tti. 

So 

[0 0 6 8] CCDJ;31CLT, B^^^F^^aS^H 1 0 



[0069] U±<Ocfc 3 tc, K^^p^MS^B 1 0 (i, 
+^T'^Sc:i:*^^v>o -^-CT', Bl^^^l^Sas^B 1 0 

[0 0 7 0] ortc, 0 5 4';^x-y:/s 3tc*5tt-5#ia 



1 Ot±. 2 0(Dl^SSfi{<:ot/^T, 



> h s s 2.<^tmmi^nw.-t ^mi&^^mmL 

ti, j.;<ToiC (2) T'#x.^n5M^^sis-rSo 

[0 0 7 1 ] 
[IS2] 



• t ■ 



(2) 



[0 0 7 2] i:C6T% ^P!KfWttS'J^fi^Pocf{c(i> 

hanger and T.D.C. Little, A survey of technologies 
forparsing and indexing digital video, J. of Visu 
ai Communication and Image Representation 7:28-4, 
1996" ^ "L. Kaufman and P.J. Rousseeuw, Finding G 
roups in Data:An Introduction to Cluster Analysis, 
John-Wiley and sons. 1990" {CSBtE^nTVSJ; -5 



[0 0 7 4] CCT\ T^^3fc^ i n^krvTs^ V )V 

[0 0 7 5] s/c, ©t^^i^fias^ai o(±. ±aiL/c 



KSglii, rtiW, L lgg«t^T'feS„ CCT, IttCL 1B§ 



SBlOti, L \^m^mXt^o CCX. 2OC0n^^ 
TC-^^' h;b^A, Bi:Ufc^-&> A, B^OL18gS6d 
Li (A, B) (3) T'^^e>nSo 

[0 0 7 3] 

[^3] 



... (3) 



S Fi, S F zOKOl^SteittSiJ^aiPfi. (4) CD 

[0 0 7 6] 

[5!&4] 



(11) 
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[0 0 7 7] CCT, ±^ (4) fC*5tt5MS[d 

F (F,. Fg) {±, ^oa^t^s^fi^itmaFicot/^ 

[0 0 7 8] i:C6T% efei^g^SQSSB 1 0 {i, -ty 
SHlOli. k{a(D!K!P?gl[«Fi. Fg. • • F^*^ 



[0 0 7 9] 
[lis 5] 



(5) 



[00 8 0] CCT, {w,} ti, z,w,= ii:^§a 

[0 0 8 1 ] \:a±.(d^ o {c. B^ft^^^as^a i o 

[0 0 8 2] O^tc, ig54'XT-"y:/S 4{c*3tJ§f'x 

[0 0 8 3] i:c:6T% UiT{c^ll^n§^<tt^x-> 

^^^l&^aaSB 1 0 tC*5V^T«^ 1 OCD^^x — ^^^IS 
iil<D^5J^:/{C«-rSili:A<oJtgT'feSo CCTti, CCD 

[0 0 8 4] ^T, mm^:iL~y<D^-(yii. mmmn 

^<Dt. ^i^iim^x-y<o«jfi(c$i|^^Wr*fe<Oi: 
-a<D-feyy>h S i J. Si^^afC 



k = 1 . • • • . m- 1 {COV^T i k< i k^iX'Sb?>c iE 
P>{C, I C I {±. f=-x— XOfi^^^L. Cstartj^uf 
Cend{i^ ^n^'n> e7':t-r'— :5'tCtett§^x — >C 

^x— VCcoPii&^SiJfi. ^x— ^Cfc^jltSg*^ 
^x->'C^ct5^t§«^tcD■t:y^>'h(Di^T^^J■Z? 



If* 



<DmilA-t->'?iyh^. A' , A' • . A 
a r (Sj. S2) X-mto 

[0 0 8 5] ^m.mU9-:si-ytl^^tsmn-irif:>iyh 

(ommcmm^^-t?>miu^3L-ytLxit. m^mn 

[0 0 8 6] *-r, m^mm^jL-yx^^i}\ cn 

a. mi ic^.ti:^ic. ^xcD-t^^yhtjm^^icmn 

mi^-t^":^ y h^-^ii-zfit-r^rcitxD^^iy-iiy^T 

yl/ rf U X A X« ^ X ^ 'J > if 7 ;l/ ri" 'J X A ©*a* i: L 

[0 0 8 7] — 73> U >^SRfa^x — El 8 fc^ 
-r-fc^t, |^ig-r5-t:y;<>'h*'«SV^(<:5g{WL/c5^x- 
>CT$>§o -r^tJ^^ U >i'®{a^x->'T{i, ifeT 
(Ok=l, ICl-1 (COVT, s i m i la 

r (Sk. Sk+i) T-fe^o CCD'J y^JiM^x — 

±a?Lfcji(K-try^>'ho^«*^e, A' , A* • . 
A' ' ■ . • • • tmmt^ctif^x^^o 
[0 0 8 8] ^etc. mmw}^3L—ytit. m9ic7jkt 

i:^fttL/-c^x->C<-yciic'^?fe^o -r^t*-^, ^WW 
?-x-i^T'ti. ifeTiOk= 1 . ICcycncl- 
1 tCOV'-T, s i m i 1 a r (Sk, Sk+j) T'feSo ^ 



fi^^x— >{4. Si. Sg. 



• • • 



• * 



. s 



» t t 



[0 0 8 9] 1Sjifl^SiJ*^*W-r§«fW^x->i: 
LTti, ^m^x->. i^— ^x— >*^feSo 

[0 0 9 0] CCT% ^m^x — ^tti. ±aiL/cJ;^ 




(12) 
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So f'&t)*., ^m^x-yT'ti, f-x— >F«9<0 2C>(D 
a pta-rt. ±T<Ok= 1 , ICI-HCO 

i k^g a pT'fe-So 
[009 1] $/c, f-x-^rtO-b^^V^hA^JSJ^IfL 

1/ ^^KP^PBT'^n-sJi^, c n«as^ t:x3^^Ji^D^ 



x—yt'&m-t^o <1CX\ ^x— >C©1^— 14u n i 
f o r m i t y (C) (6) »c^^J:^tc, II 

[0 0 9 23 



imiformit)^C) = 





( Q it art Qjturl\ 








1 


c 




\c\-\ 









(6) 



CO 0 9 3] ±^ (6) f^^n^^x-VCOl^-14 
u n i f o r m i t y (C) tt. 0*^6 1 tOffiH^Offi^ 

^^fli(cjfiv>c:i:^^-ro COifel— ttu n i f o 
rmi ty (C) OffiA^m^Oi^— ttrsailJ; 0 fc/J^^I/> 

^x — yc^i^— ^x— >i:*t^-ro 
[0 0 9 4] J-:iT> RJ^^^^sasSBi otcfei^T, c: 

Oct 3 ^^ffl^x— xD^-n^'n^i^m-r src46<DMs 
[0 0 9 5] efe^^^iassH 1 o«, ±aiLfea*3i 

10 0 9 61 /'i^y^i'^ 7,^ Vy-iflAWi tit. ^x-v 

i^eti^x-:y^ai^si^p^x «^fl-rn{f, e-r^t-r-^? 

^«JiiXt±IBSi-r'5i:|llNfli:^x->*^tbLTV>< C 

[0 0 9 7] W^^f^^aaSHl Ofi, ^^y^f-^'yT.^ 
U>'y^ffi^fflV^5^-&{c{i, 0 1 OtcS^-TJcdfc. 2 

[0 0 9 8] ^-r, «ii^^^jassBi oa. xx^t" 



[0 0 9 9] B5(!#©^^!aa^H 1 Oti. ^fttf^x — 

S C i: A^T' t ^ "L. Kaufman and P.J. 
Rousseeuw, Finding Groups in Data:An Introduction 
to Cluster Analysis, John-Wiley and sons, 1990" 
fEic^nrv^-SPg^fl^^^X^ U (hierarchic 
al clustering method) ^fflV^S d i: tC-^i)o CKDT';!' 
dfUXAti. «t^iaLit2oc0-ti^p(y 1 

(Cj, Cg) ^^ (7) tc^-rj:3tc. ^n^no:> 
tLT^^-rso 

[0 10 0] 
[^7] 

(7) 



[0101] ^tJ. etfife^^^fflS^Bl 0{C*it/>T«. 

^^:^s^c^c:^l:T. ±^ (7) TvTN^n§«/j>ii8^o{-^^') 

[0 10 2] i:t:6T\ dO^aW^^X^ U y^'^ 

3o •?-iiT% BftfiR^l^fiaS^S 1 Ot±. 01 nc^f.}; 

:><>hi:!Kf£i-Z:-feS*^S7t)^^fiJ»fr§o CdT, ^fiSfU 
14IB{i5sin.i:f*> l^0tc^-rJ:9tc. Z'0(D^^:^y 
VIS^E (ommMHA LTV^?,^^^i:|il-<D^x->^^:Jl-r 



» 



(13) 
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CO 1 0 3] j&fe. m^^P^ss^m^mi on. $mfaik 

^ n s ^ X i&(±'>^ < jS: 0 . $mimmm 6 3 , 
CO 1 0 4] cti^o. m^^p^mi^mi ofcfca^r 

Wic j^^-r § c: t -e # <!) o 

CO 1 0 5] mX\t. =t(D 1 0(D75-?St LT, 
fiaS^BlOtt. (n) (n- 1) /2<@cD-try;)<>h 

*fji) tyi^^rzwmm^m^^r. ^mmmms ^^^^ 

^*tx, 0. 5&O'0. 1 icia^-ri.c:i:*'«aiif^c$S*^ 

CO 1 0 6] llffl±{c*5l/>T{4. B5Sl«^|SjiaS^S 1 0 

14^3}?a6n{f <fcvv 8l^«^f^«a31^H 1 0«> CKOJ;^ 

CO 1 0 7] ^m^'^^m^si 0(4, cnsT-ic^L 



CO 1 0 8] fcil^T, 0 1 0'P7.7-^y-/S 1 1 lCt5l/'> 

i6, 5^^^^5aa^«l 0(4. X-r-y:/S 1 2fCfel/-' 
■5, e«i^^^5as^a 1 0 14, e-r;t«ji»#T(i:t>ttS 

^x-><^fficD«Bttsa'Miai4^si)^L, m^©p°pK 
S'j^aipi8{ai&±iii s ^ X - X- 



^nsMjl14jfiiJ^MiiJi:LTe*>*Sfi*0>Jf4, ^x-> 

CO 1 0 9] iiC^T, BftMfe^^Ma^Hl 0(i:*3t/''T 
(4, ^x->a°nKiS'J^»^i:LT, ^x->S, ^x- 

CO 1 10] s-r, ^x— >ST'^§*\ c:nt4, 10 



{4, — ^^i:^x-ys*vJ^^v^^-a■T■$.f3, •^n{4a^ 



^, *nt4{5j6oifffi^WLTi/^^t/>o -r^^^-fe, ^x 



CO 1 1 1] :^^{c, ^x— >2BJST'feS*^, cn«, WL 
55^x->*WW-r§±-b^V>'h8S[i;, ^O^x-> 



r 1=1 s 



s 1 0 14, c 



CO 1 1 2] at^fc, ^x-:y3S]ST'feS3!)^ c:n{4. 



->3S«%iBij^-r 'S:3^ffiicoi/>Tt4, j-xTtc^-r^x- 



CO 1 13] -^iJicLT, W 



1 Ot}\ =f- 




(14) 
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■^^Tnto ccx\ f-x-^'rtMmiijffl^&tti. ^x 

•typt^ hOffiJi: LTt±, ^x— XOM/d (centroid) 



.L>-t ify^y'h^S cent ro I d ^ <^ tOM>u^-b if^iy 

H S^entrold**- C8) T*««?n5o 

[0 1 1 43 



argmm 



(8) 



[0 115] CCT% ±S (8) JC*3tt«.a r gm i n 

[0 116] iin«tO. f-X->3SjS[^dce„troldi^"^ 



'etntnful 



CO 1 1 8] ^T, ^m=spmmmmi ot±. ±a?L/c: 



[0 1 19] ^-r, B5^ftW^MS^Bl Ot±. XT-'yf 
S2 nct3V>T. ^x— >'JXhC,,st^li^^x— > 

[0 12 0] i^V^T> ^^^^MS^Hl X7^-y 
T'S 2 2{CfcV>T, ^x->"JXhC,,3t*^^t^^T-* 

CO 1 2 1] CdT, ^x->UXhC,,st*^^^KST' 
CO 12 2] -75", ^x->"JXhC,,st*^^4^*'«?'5c 

v^^^tcti, t^m^^n&m^mi ot±, xx-y^s 2 3 

I^OS^tL, ^x— :yC^^x->'UXhCi,st*"'^ 

[0 12 3] m^^r. mm^pmm&mi ot±. x-r-y 

CO 1 2 4] ^LT, ?!fe«f!§f^i!!lS^H 1 Oli. Xx-y 

[0 12 5] ccr', ^x->'p^MS'J^aqs*"«n°p«i'J^ 



^^«iasg 1 0 



[0 12 6] -73. ^x->n^ss'j^a^3!)"«p°pKS'j^a 



CO 1 1 7] 

CSJ9] 



centro 



iji. (9) <oJ; 



» t » 



(9) 



[0 12 7] -eur. Blfeia^p^MS^B 1 Oti. Xr-^y 
:/S 2 7tc:43V^T. ^x->"JXhC,,st*^^^'c®T'^ 

CO 12 8] C<IT% ^x->UXhC,,st7b^^4'cftgT 

^-^/t^^tcw. mm^wmmmi o«, ^^t^t-r?. 

<g1S^x-y3!)^#ffiL^v^c:i:A^e>. — a<05as^*l7 

COl 2 9] — ^x— >'JXhC,,st*^^*'c^'C'^ 

i^^^fcii. mm'Spm^mi o^i. xx-y^s 2 3 

Hi Oti. ^x->UXhC,,stA^^mii:^i)S-(:-Sa 
[0 13 0] CCOct^^— a<D5aSti:<i;oT, 

^assBioti. ^x— if(o 

^x->A\ e7-':tli3i(D#tS^^-r«S^^x->T' 

fe«.7!)\ lKi/H±, tf:T-*;i-«3g{cB8a-r5^x->T'fe?) 

CO 1 3 1] J-X±OJ:9ti:, %^#^g^laS^B 1 Ot±. 

fK^x - ^^^llil-r S c i: Tb^T' # So 
[0 13 2] i:<l5T\ B^^ft^pSMSSB 1 Ott, /^y 

tc?,, >'^'y^^'^X;?U>^S*Rt|5|«ltc, ^x- 



>^'S^I5i:|Sl^<D^x->p°nKffli)^»^P^ffli/^T, 1^ 



I 



(15) 
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CO 1 3 3] m^-^^^^vyf^mic^^^r 

h <0*^#ty|»rfc * V X ^^fig-r § ^^^^mMtcifiJ 

(0i^r>^TJl:dVXl.l,c-O\,^rii. "J. Roure and L. 
Talavera, Robust incremental clustering with bad i 
nstance orderings;a new strategy, In Proceedings of 
the Sixth Iberoanerican Conference on Artifical I 
ntelligence, IBERAMlA-98. Pages 136-147. Lisbon, P 
ortugal. Helder Coelho ed.,LNAl vol. 1484. Springe 
r Verlag, 1998" (Dm^^^mt^C t-h^'X'^^o 

CO 1 3 4] f^m^pmm^mi oit. m^^'^t.^v 



nun 



SC 



CO 1 4 2] ±1^ ( 1 0) tCfcl/^T, dsc (C, S) 

^ab. (11) x^^^n^ho 



CO 1 4 4] cnti. >'^y^^^X:Jf'Ji^^'*fi^t*l5tc*5t^ 

x^mi^rzmMm^mmx^^±^ (?) »c*31/>t. 



m in 



s,) micd^t„tLxm-r<itft^^ 

CO 1 4 5] ij)cic, m^si^pmmmmi oit. xx>yy 

CO 1 4 6] cdT'. m^Nmimd^^^mmimm 



CO 1 3 53 ^-r, m^^pmmMi on. i^atc^ 

-rJc^tC, X-r-y^S 3 1 fCt5l.''T> ^x— VUXhC 
,,st*^4^^*^*3W<tl^> Xf-'V^S 3 2{ct3l/^T, -tr 

> hm^ i ^ 1 fctS^f 5o 
CO 1 3 6] i^tc. ^^^PffiaSgl 0{±. Xx<yy 

CO 1 3 7] c:CT% i tim-t-:fyi>\' 

CO 1 3 8] -7^. -t^^yhS^ i m-tyyVhSa 
ncfcOt/h^(/^®-&tct±, efe^^F^MSSB 1 0 (i. X 
7^>y7°S 3 4{C*5V''T. -b^/y^'hS,. f^^-^CC 
T'«-tr^pt>hSi^^03i*. X-r-yT'S 3 SJCfcl,^ 

CO 13 9] ccx\ ^x->yxhc,,3r*^^m^-?? 

CO 1 4 0] —73s ^x->UXhC,,st*''?S4*C^T':& 

i/^^-a-tctis ^B'^pmm^m I on. X7"<yys 3 e 

^s*: (10) <0J:dti:^«5nSo 

CO 1 4 1 ] 

Cifel 0] 



(10) 



CO 1 43] 

HSl 1] 



(11) 



X— ^Cnew^^fiStl^- X'r-y:rs 4 3tctiV'«T, Ifrfc 

X-r>y:rs 3 QOMS'viiJ^f-rSo 
CO 1 4 7] -73, «/jN|fMfUttd„,„*^5filKM't4Biffl 
5 sin.<^ f5 t>/h?v^«^Jc{i. Bft^^l^^aaSK 1 0 
tis XT--y7°S 3 8tc*sv>T, ^x->C„j„{£:^M-tr 

1 o(±, c„,„^c„,nUs,i:-ri.o 

CO 1 4 8] ^LX. ei^«^^Ma^fil Olt. X-7^<y 



(16) 
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filtered*^ ii*n-r5o 



CO 1 4 93 5e>{Cx w^^Pffiasiii o{±. 
4 otcfev^T, i^'^Mfc^x->^^wr§o -r^ 

'J >'i'*$tlfc^X->UX h Cfjueped^^^t'^'^- 
CO 1 5 0] ^-LT, ©i^^^^aS^Sl oa. Xx-y 

:/S 4 1 {Cfcl^T. i tC 1 i&jljqgEL, X 

7^-y:/S 3 3^D^aa'^i:^^f■r§o 
CO 1 5 1 3 CCO«fcd{i:LT. Rit^^^^mSSB 1 0 

CO 1 5 23 ^*3, (Biiiifc^-r— xti-sti 

•tr y ^ > h 3S n HU t T # ;^ e> n T V ^ ^ 1/ ^^-S- 1, ^ 
(.^o 3S:^>j^^X^U>y7';brruXAJi. |b| 

CO 1 5 33 jiOSaStCctoT. ^{i^p^ 

CO 1 5 43 offtc. ±itiLrzvy-^mn^x.-y^^ 



argmin 
C7 = ^ 

nun ^^^tkt 

CO 1 6 03 ±^ (1 2) tC*il>T> dsc (C. S) 
{±. C <D|f ^HfW-ttiliJ^S^P d sc (C. S) It. Cl 



CO 1 6 23 -r^fe-^, nmm'\immmm d sc (c. 
s) (i, a*^^w^x->'<D^^ai<Dl^{i:fflv^fc^fiS^tl14 

CO 1 6 33 B*«^Pj!aaSai Oti. T.^rylf 

CO 1 6 43 CCT% «/Mf«im4d„,„3!)<^N^fai4Ba 



*5ttSU>i?ISffil^x-y©1^ltll«, **:Sftl^x-:/ 
ffll/^fcU^^aitt^x-^l^tbTJiStLT, 01 4{C^ 

CO 1 5 53 m.mm^'sm^mi oti. iHiiiitc^-rj:^ 

{C, Xx -y :/S 5 nc tiV^T, ^x— >"JXhC,,stig: 

hs^ i ^ 1 {cis^-rso 

CO 1 5 63 R^^g^^aSSBi 0«, 7.7- -j-f 
CO 1 5 73 CdT', -ty^VhS^ i y h 

CO 1 5 83 -73, -ty/VKS^ i 7!)<!^-by^>h3& 

•r-yT'S 5 4tc*5i,'>T, ■try;><>hS,, -rift*3-5Cil 

T, ■fey^>hSiJcWt--g>lfa{«'l43!j^g/hTfe55^x 
— >C„,„^5R46^o CcIT\ ^x->C„,„t±, i^iS: 

(1 2) <0<t9fC^g^n5o 

CO 1 5 93 

Clil 23 



(12) 



3) T*-^x.ens, 

[0161] 
iWLl 3] 



(13) 



X - > C „e^*^ X - > U X h C , , sticig/jn LT, 
CO 1 6 53 -73, «/jN^f®mi4d„,„A^lf«fOTiaffl 



Xx'y^S 5 TtctSV^T, ^x— >C„,n«D*aS(c 



SBi oa, c„,„^c„,„, s,i:-r§o 

CO 1 6 63 ^LT, BftfiK^^iaa^B 1 Oti, X-r«y 

:/s 5 8tc*iv^T, ^x->'*7^'y^^f 'j>y-rSo f 

S^^x— >'Cec,,st^^"^''''^^- ^x— ^COn'&M^ 



% 



CO 1 6 7] $e>{i:x m^sPi&mi&m I on. t^t-^j 

CO 1 6 8] ^LT. RS^^^I^iaS^B 1 Ott. Xx«y 
T'S 6 OtCfcl-'Ts •fe^^^'hS^ i tCl ;&J!|D»L, X 
•r<y7°S 5 2,(n%M^hW<ci-^^o 

CO 1 6 9] iKDi^tcUT. Bft^^PJiaSSBl 0 

^ i Xj^i^^-fe^y^hiSjn J:!? t;^#<^ofc^(D^x- 
>"JX^C,,st®^S^^x->^:. UV^'Sffil^x- 

CO 1 7 0] cc0ct3!&— aojaatcfcoT. efc^^i^ 
CO 1 7 1] ^*5. nm\z^<t-&(^%M\t. xfi^^ 

•fe y MS n Hij t -3 T # P> n T V > * I t ^ 

i/^o jSi^^7^X^'jyi^"7;Un';XAti. |H| 

CO 1 7 2] ortCs ±a5Lfc/i«^M^x->^^ttJ-r 

Sjaatcoi/^raiiW-rSo ^WW^x-:yCeyciic«^ 

k i® IDS* § s*^it(^ X - ^xt± 'J > ^ ^m^^ X - > 

^> Si. • • •. Sni:8BaiU %rcZ (S,) -tr 
^^■;^>hS,0J±li^7t;0^x->#^l. • • k% 

^•rciifc-rso c:nj;t). Ceyciic*''^wfl*i^x->' 

C (Sj) . C (Sz) . • • •, c (S 
n) ao^x->S^OMmi, i ,. • • • . 1 
k. i 1- • • • . i k. • • • . > 1. • • • . i k^^^^ 

i ,, • • • , i ySt^ ^x — 1 . • • • , kOM 

fe§)iW6^^x— > i i ,. • • i 
^x — >i:^»-r-i> C i: i:-r-i>o 
CO 1 7 3] i:c:5T% lfx^-r-;S?(C*5lt5;il 

fi*)^*>ot?fe?)fc46> w^^pSMa^ai o«. lai 5 
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BlOti. i^MlJCfSUT, ^07ci:*5»2|s:«W^x— 

CO ! 7 4} *-r, eft^^^Ma^Bi ofi. rasitc^ 

■rJ;9(C, X'r'y:/S 7 1 &t>'X7^-vyS 7 2{Ct5i/-' 
T. e-r5i-7^-:Jf{C^^nsa*®W^x->'^^m 
^ntca-^i,>TfiJW5^x->'UXh^3fe4jtL. 

X — > u X h # $ :n s K«f^ X — y 
->ux h^Mff-r^o 

CO 1 7 5] -r^tj^. efe^^^jas^Bi oti, xt- 

•;/-/S 7 1 tCt3V^T> ±aiL^c»*5®{K^x->X{iU 
>^WfK^x->%«^Hd-r§7'>'Urf'JXA^ffl(/^T. m 

W^x — >U X h C , ,st***^>^o 
CO 1 7 6] ^LT. efe4S^|S*!iS^B 1 Ot±, Xx-y 
7 2{i:*3V^Ts ^ZlW^x-^UX h(C#$n?.#^ 
x->C{cot/>T, 1S^«^L. ^x-yCTb^ 

PB*^«:*c t * § J; d ^«3a<Di^— 9- x - i^tc^^-f J-T 

ilV^T. BjJ^^g^MS^Hl 0«. t#P>nfci£j— 9- 
:/^x->^, ±iELfca*!S{U^x-i/X(±'J>^!P 

X - > ^l^a -r ^ 7 ;!/ d" U X" A {c fcv ^ T BiB^ U /c ct 
5^^x->p^SiSfJ^a^^fflV^T7i'->l/^f 'J ^^'"L, 
jMJi^^nfc^— 9-y^x-i^^l9]W^x->'J X h C 

CO 1 7 7] :^;^lc. ^S^ft^^SasSB 1 Oti, 7.7- -J 

S7 3(c:*5V>T. ^x — >UX hC,,st®**'^^ 

CO 1 7 8] ^LT. HiltS^Tg^^aS^fi 1 Oti, X-r-y 
7"S 7 4{C*5l/^T. iIC0<fcd^M^LT</^'5^x — 

CO 1 7 9] <1C:T\ a^LTVS^x — >Ci, Cz*'* 

#ffiLJS:i/^«-a-tc:{±, fl*^^?S5aa^B 1 Oli, ^x- 

> u X h c , ,3tA''ef i^^iSi'^Miww^ X- 

CO 1 8 0] — tS', S^tLTt^i.^x — >Ci, CzA^I? 

ft-r^^-a-fcti. wsMK^i^ias^fi 1 o«. XT^-yT's 

7 57^SX-r-y :/S 7 8^ct5V'>T^ 2O£0^x— > 
C,, C2*^*i:$t>/-clO<0^«3W^x->*«ifig-r* 

CO 1 8 1] rtit>-h. mmm^9&mmmi at. xr- 

>yys 7 5^Cfcl/^T, 20(D^x— >C,. Cz^^t)-^: 
Ts Krfc^^WW^x-^CM^JKfiKI'So CCT. 5^ 
x-^CHtCfcttS-b^'.pO'h^Si. Sz. • • •. S 



(18) 
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[0 1 8 2] igtl^T, eWS^p^JaaSH 1 Oti. y 
(S,) . C (Sg) . • • •. C (S|cu|) tCtSI/^TC 



„^-9-:r^x->CM>. c„2 • • z^\z.'mt 

So iioiS*. efe^^i^^aasai oa, (i 4) 
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(54) SIGNAL PROCESSING METHOD AND VIDEO SOUND PROCESSING DEVICE 
(57)Abstract: 

PROBLEM TO BE SOLVED: To extract a high level video structure in various videos. 
SOLUTION: A video sound processing device 10 is provided with a chain detection 
part 1 6 and a chain analysis part 1 8. In the chain detection part 1 6, feature quantities 
extracted from video segments and/or audio segments divided from a stream of 



inputted video data and a measurement reference which is calculated for each feature 
quantity by using the above feature quantities and measures the similarities between 
video segments and/or audio segments are used to detect a similar chain consisting 
of a plurality of video and/or audio segments similar to each other out of video 
segments and/or audio segments. In the chain analysis part 18, the similar chain is 
used to perform analysis, and a local video structure and/or a global video structure 
of the video is determined and outputted. 
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CLAIMS 



[Claim(s)] 

[Claim 1]A signal processing method comprising: 

A characteristic quantity extraction process of extracting at least one or more 
characteristic quantity showing the feature from a segment which is a signal 
processing method which detects and analyzes a pattern reflecting a semantic 
structure of the contents of the supplied signal, and is formed from a series of a 
continuous frame which constitutes the above-mentioned signal. 
A similarity measuring process which computes metrics which measure similarity 
between pairs of the above-mentioned segment for every each of the 
above-mentioned characteristic quantity using the above-mentioned characteristic 
quantity, and measures similarity between pairs of the above-mentioned segment by 
these metrics. 

A detection process which detects a similar chain which comprises two or more 
segments mutually similar among the above-mentioned segments using the 
above-mentioned characteristic quantity and the above-mentioned metrics. 

[Claim 2]The signal processing method according to claim 1 provided with an analysis 
process which analyzes using the above-mentioned similar chain, and determines and 



outputs local structure and/or global structure of the above-mentioned signal. 
[Claim 3]The signal processing method according to claim 1, wherein the 
above-mentioned signal is at least one of the video signals and audio signals in a video 
data. 

[Claim 4]The signal processing method according to claim 1, wherein the 
above-mentioned similar chain has restrictions in a relation between similar segments 
which the similar chain concerned contains. 

[Claim 5]The signal processing method according to claim 1, wherein the 
above-mentioned similar chain has restrictions in structure of the similar chain 
concerned. 

[Claim 6]The signal processing method according to claim 4, wherein the 

above-mentioned similar chain is a basic similar chain where all the segments which 

the similar chain concerned contains have a mutually similar relation. 

[Claim 7]The signal processing method according to claim 4, wherein the 

above-mentioned similar chain is a link similar chain where an adjoining segment has a 

mutually similar relation in all the segments which the similar chain concerned 

contains. 

[Claim 8]The signal processing method according to claim 4 characterized by being a 
periodic chain which has a relation similar as mutually [ the number of predetermined / 
from the segment concerned / in each of a segment ] as a segment by which it has 
been arranged back in all the segments in which the similar chain concerned contains 
the above-mentioned similar chain. 

[Claim 9]The signal processing method according to claim 5, wherein the 
above-mentioned similar chain is a partial chain whose time interval in each set of a 
segment which adjoins in all the segments which the similar chain concerned contains 
is shorter than predetermined time. 

[Claim 10]The signal processing method according to claim 5, wherein the 
above-mentioned similar chain is a uniform chain with which a segment appears at 
intervals of isochronous approximately in all the segments which the similar chain 
concerned contains. 

[Claim 11]The signal processing method comprising according to claim 6: 

A candidate chain detection process which the above-mentioned detection process 

detects and summarizes a mutually similar segment using the above-mentioned 

characteristic quantity and the above-mentioned metrics, and forms a candidate 

chain. 

A filtering process of outputting only a candidate chain which computes quality 
metrics corresponding to a numerical standard for every each of the 
above-mentioned candidate chain, and measures importance and relevance of the 
above-mentioned candidate chain in structural-patterns analysis of the 
above-mentioned signal and with which the above-mentioned quality metrics exceed 



a predetermined quality metrics threshold. 

[Claim 1 2]The signal processing method according to claim 2 carrying out sequential 
processing of every one segment concerned according to time order to which a 
segment was supplied among segments in the above-mentioned signal. 
[Claim 13]The signal processing method comprising according to claim 12: 
A candidate chain detection process which the above-mentioned detection process 
updates as required a candidate chain which contains the segment concerned using 
target above-mentioned characteristic quantity and above-mentioned metrics about 
a segment, and is searched for. 

A filtering process of outputting only a candidate chain which computes quality 
metrics corresponding to a numerical standard for every each of the 
above-mentioned candidate chain, and measures importance and relevance of the 
above-mentioned candidate chain in structural-patterns analysis of the 
above-mentioned signal and with which the above-mentioned quality metrics exceed 
a predetermined quality metrics threshold. 

[Claim 14]The signal processing method comprising according to claim 8: 

An initial periodic chain detection process which the above-mentioned detection 

process asks for an initial candidate of a periodic chain. 

A duplication chain detection process which asks for a duplication chain which 
crosses in time out of an initial candidate of the above-mentioned periodic chain. 
A consistency process of searching for consistency of the above-mentioned 
duplication chain. 

[Claim 1 5]The signal processing method according to claim 2 detecting and outputting 
a scene which is a subset based on a meaning of a segment as a local structure of the 
above-mentioned signal according to the above-mentioned analysis process using the 
above-mentioned similar chain. 

[Claim 16]The signal processing method according to claim 2, wherein a mutually 
similar segment detects and outputs structural patterns by which it is generated 
repetitively as a global structure of the above-mentioned signal according to the 
above-mentioned analysis process using the above-mentioned similar chain. 
[Claim 17]The signal processing method according to claim 16 detecting and 
outputting a news item in newscasting as the above-mentioned structural patterns. 
[Claim 1 8]The signal processing method according to claim 1 6, wherein a play detects 
and outputs video structure in a sportscast generated repetitively as the 
above-mentioned structural patterns. 

[Claim 19]The signal processing method according to claim 2 detecting and outputting 
topic structure which summarized a scene related among scenes which are the 



subsets based on a meaning of a segment according to the above-mentioned analysis 
process using the above-mentioned similar chain. 
[Claim 20]A video voice processing unit comprising: 

It is a video voice processing unit which detects and analyzes a pattern of an image 
reflecting a semantic structure of the contents of the supplied video signal, and/or a 
sound, A feature amount extracting means which extracts at least one or more 
characteristic quantity showing the feature from an image formed from a series of a 
continuous image which constitutes the above-mentioned video signal, and/or an 
audio frame, and/or a sound segment. 

A similarity measuring means which computes metrics which measure similarity 
between pairs of the above-mentioned image and/or a sound segment for every each 
of the above-mentioned characteristic quantity using the above-mentioned 
characteristic quantity, and measures similarity between pairs of the 
above-mentioned image and/or a sound segment by these metrics. 
A detection means to detect a similar chain which comprises two or more images 
and/or sound segments mutually similar among the above-mentioned image and/or a 
sound segment using the above-mentioned characteristic quantity and the 
above-mentioned metrics. 

[Claim 21]The video voice processing unit according to claim 20 provided with an 
analysis means to analyze using the above-mentioned similar chain, and to determine 
and output local video structure and/or global video structure of the 
above-mentioned video signal. 

[Claim 22]The video voice processing unit according to claim 20, wherein the 

above-mentioned similar chain has restrictions in a relation between a similar image 

and/or a sound segment which the similar chain concerned contains. 

[Claim 23]The video voice processing unit according to claim 20, wherein the 

above-mentioned similar chain has restrictions in structure of the similar chain 

concerned. 

[Claim 24]The video voice processing unit according to claim 22, wherein the 
above-mentioned similar chain is a basic similar chain where all the images and/or 
sound segments which the similar chain concerned contains have a mutually similar 
relation. 

[Claim 25]The video voice processing unit according to claim 22, wherein the 
above-mentioned similar chain is a link similar chain where an adjoining image and/or 
a sound segment have a mutually similar relation in all the images and/or sound 
segments which the similar chain concerned contains. 

[Claim 26]In all the images and/or sound segments in which the similar chain 
concerned contains the above-mentioned similar chain, The video voice processing 
unit according to claim 22 with which each of an image and/or a sound segment is 



characterized by being an image and/or a sound segment by which only a 
predetermined number has been arranged back, and a periodic chain which has a 
mutually similar relation from the segment concerned. 

[Claim 27]The video voice processing unit according to claim 23, wherein the 
above-mentioned similar chain is a partial chain whose time interval in an image which 
adjoins in all the images and/or sound segments which the similar chain concerned 
contains, and/or each set of a sound segment is shorter than predetermined time. 
[Claim 28]The video voice processing unit according to claim 23, wherein the 
above-mentioned similar chain is a uniform chain with which an image and/or a sound 
segment appear at intervals of isochronous approximately in all the images and/or 
sound segments which the similar chain concerned contains. 

[Claim 29]The above-mentioned detection means detects and summarizes a mutually 
similar image and/or a sound segment using the above-mentioned characteristic 
quantity and the above-mentioned metrics, and forms a candidate chain, Quality 
metrics corresponding to a numerical standard are computed for every each of the 
above-mentioned candidate chain, The video voice processing unit according to claim 
24, wherein it measures importance and relevance of the above-mentioned candidate 
chain over structural-patterns analysis of the above-mentioned video signal and the 
above-mentioned quality metrics output only a candidate chain which exceeds a 
predetermined quality metrics threshold, 

[Claim 30]The video voice processing unit according to claim 21 characterized by 
carrying out sequential processing of image concerned and/or every one sound 
segment according to time order to which an image and/or a sound segment were 
supplied among an image in the above-mentioned video signal, and/or a sound 
segment. 

[Claim 31]The above-mentioned characteristic quantity and the above-mentioned 
metrics about target above-mentioned present image and/or sound segment are used 
for the above-mentioned detection means, Update as required a candidate chain 
containing image concerned and/or a sound segment, ask for it, and quality metrics 
corresponding to a numerical standard are computed for every each of the 
above-mentioned candidate chain. The video voice processing unit according to claim 
30, wherein it measures importance and relevance of the above-mentioned candidate 
chain in structural-patterns analysis of the above-mentioned video signal and the 
above-mentioned quality metrics output only a candidate chain which exceeds a 
predetermined quality metrics threshold. 

[Claim 32]The video voice processing unit according to claim 26, wherein the 
above-mentioned detection means asks for an initial candidate of a periodic chain, 
asks for a duplication chain which crosses in time and searches for consistency of the 
above-mentioned duplication chain out of an initial candidate of the above-mentioned 
periodic chain. 



[Claim 33]The video voice processing unit according to claim 21, wherein the 
above-mentioned analysis means detects and outputs a scene which are an image 
and/or a subset based on a meaning of a sound segment as local video structure of 
the above-mentioned video signal using the above-mentioned similar chain. 
[Claim 34]The video voice processing unit according to claim 21, wherein the 
above-mentioned analysis means detects and outputs structural patterns which a 
mutually similar image and/or a sound segment generate repetitively as global video 
structure of the above-mentioned video signal using the above-mentioned similar 
chain. 

[Claim 35]The video voice processing unit according to claim 34, wherein the 
above-mentioned analysis means detects and outputs a news item in newscasting as 
the above-mentioned structural patterns. 

[Claim 36]The video voice processing unit according to claim 34. wherein the 
above-mentioned analysis means detects and outputs video structure in a sportscast 
which a play generates repetitively as the above-mentioned structural patterns. 
[Claim 37]The video voice processing unit according to claim 21. wherein the 
above-mentioned analysis means detects and outputs topic structure which 
summarized a scene related among scenes which are an image and/or a subset based 
on a meaning of a sound segment using the above-mentioned similar chain. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Field of the Invention]This invention relates to the video voice processing unit which 
detects and analyzes the pattern of the image reflecting the semantic structure used 
as the signal processing method which detects and analyzes the pattern reflecting the 
semantic structure used as the foundation of a signal, and the foundation of a video 
signal, and/or a sound. 
[0002] 

[Description of the Prior Art]For example, there is a case where he would like to play 
in search of the portion of a request of an interested portion etc. out of the video 
application constituted with a lot of different picture image data called the TV program 
recorded on the video data. 

[0003]thus. there is a storyboard which is the panel which put in order a series of 
images describing the major scene of application as general art for extracting desired 
image contents, and was created. This storyboard decomposes a video data into what 
is called a shot, and displays the image represented in each shot. Such image 



extraction art the most. For example, G. Ahanger and T.D.C. Little, A survey of 
technologies for parsing and indexing digital video, J. ofVisual Communicatio n Detect 
a shot automatically and extract it from a video data as indicated to and Image 
Representation 7:28-4 and 1 996." 
[0004] 

[Problem(s) to be Solved by the Invention] By the way, hundreds of shots are also 
contained, for example in the TV program for 30 typical minutes. Therefore, in the 
conventional image extraction art mentioned above, when the user needed to 
investigate the storyboard which put an extracted huge number of shots in order and 
understood such a storyboard, he needed to force the big burden upon the user, in the 
conventional image extraction art, the shot in the conversation scene which photoed 
two persons by turns, for example according to a speaker's change had the problem 
that there were many redundant things. Thus, as an object which extracts video 
structure, the hierarchy of the shot was too low, there was much useless amount of 
information, and the conventional image extraction art of extracting such a shot was 
not able to be said as the good thing of convenience for the user. 
[0005]As other image extraction art. For example, "A. Merlino and D. Morey. and M. 
Maybury and Broadcast. As indicated to news navigation using story segmentation, 
Proc. of ACMMultimedia 97. and 1997" or JP,10-136297.A, There is a thing using the 
very special knowledge about specific contents genres, such as news and a football 
game. However, as a result of this conventional image extraction art is not helpful to 
other genres of what can obtain a good result at all about the target genre and being 
further limited to a genre, there was a problem of not being easily generalizable. 
[0006]As other image extraction art, there are some which extract what is called a 
story unit as indicated, for example in the U.S. Patent # No. 5,708,767 gazette. 
However, this conventional image extraction art needed a user s intervention, in order 
that it might not automate thoroughly and which shot might determine whether to be 
what shows the same contents. It also had the problem that it was limited only to 
video information as an applied object while this conventional image extraction art had 
the complicated calculation which processing takes. 

[0007]As other image extraction art. there are some which identify a shot by 
combining shot detection and silent part detection further again as indicated, for 
example to JP.9~214879.A. However, this conventional image extraction art was 
limited only when a silent part corresponded to a shot boundary. 
[0008] As other image extraction art. For example, "H. Aoki, S. Shimotsuji and O.Hori, 
A shot classification method to select effective key-frames for video browsing.. In 
order to reduce the redundancy of the display in a storyboard as indicated to IPSJ 
Human Interface SIG Notes. 7:43-50. 1996". or JP,9-93588.A. there are some which 
detect the repeated similar shot. However, this conventional image extraction art can 
be applied only to video information, and cannot be applied to speech information. 



[0009]Such image extraction art [ like ] was able to detect only what is called local 
video structure and the global video structure based on special knowledge. 
[0010]This invention is made in view of such the actual condition, and solves the 
problem of the conventional image extraction art mentioned above, and an object of 
this invention is to provide the signal processing method and video voice processing 
unit which extract the video structure of the high level in various video datas. 
[0011] 

[Means for Solving the Problem]A signal processing method concerning this invention 
which attains the purpose mentioned above, It is a signal processing method which 
detects and analyzes a pattern reflecting a semantic structure of the contents of the 
supplied signal. A characteristic quantity extraction process of extracting at least one 
or more characteristic quantity showing the feature from a segment formed from a 
series of a continuous frame which constitutes a signal. Metrics which measure 
similarity between pairs of a segment are computed for every each of characteristic 
quantity using characteristic quantity. It is characterized by having a detection 
process which detects a similar chain which comprises two or more segments 
mutually similar among segments using a similarity measuring process which measures 
similarity between pairs of a segment by these metrics, and characteristic quantity 
and metrics. 

[001 2]A signal processing method concerning such this invention detects fundamental 
structural patterns of a segment similar in a signal. 

[001 3]A video voice processing unit concerning this invention which attains the 
purpose mentioned above. It is a video voice processing unit which detects and 
analyzes a pattern of an image reflecting a semantic structure of the contents of the 
supplied video signal, and/or a sound, A feature amount extracting means which 
extracts at least one or more characteristic quantity showing the feature from an 
image formed from a series of a continuous image which constitutes a video signal, 
and/or an audio frame, and/or a sound segment, Metrics which measure similarity 
between pairs of an image and/or a sound segment are computed for every each of 
characteristic quantity using characteristic quantity. A similarity measuring means 
which measures similarity between pairs of an image and/or a sound segment by 
these metrics. It is characterized by having a detection means to detect a similar 
chain which comprises two or more images and/or sound segments mutually similar 
among an image and/or a sound segment, using characteristic quantity and metrics. 
[001 4]A video voice processing unit concerning such this invention determines and 
outputs an image similar in a video signal, and/or fundamental structural patterns of a 
sound segment. 
[0015] 

[Embodiment of the Invention]It explains in detail, referring to drawings for the 
concrete embodiment which applied this invention hereafter. 



[0016]The embodiment which applied this invention is a video voice processing unit 
which discovers the desired contents automatically and extracts them from the 
recorded video data. Especially this video voice processing unit introduces the 
concept of a similar chain (it is hereafter written as a chain if needed.), in order to 
detect and analyze the structural patterns of the image reflecting the semantic 
structure used as the foundation of a video data, and/or a sound and to conduct this 
analysis. Before giving concrete explanation of this video voice processing unit, in this 
invention, explanation about the target video data is given here first. 
[0017]In this invention, about the target video data, as shown in drawing 1 , a model 
shall be made, and it shall have structure of a frame, a segment, and a similar chain. 
That is, a video data is constituted by a series of frames in a least significant layer. A 
video data is constituted by the segment formed from a series of a continuous frame 
as a hierarchy on one of the frames.A video data constitutes a series of segments 
which have a specific kind of similar pattern of each other as a similar chain. 
[0018]This video data includes the information on both an image and a sound. That is, 
in this video data, the video frame which is a single still picture, and the audio frame 
showing the speech information by which the sample was generally carried out in short 
time, such as tens - hundreds milliseconds / merit, are contained in a frame. 
[001 9]A segment comprises a series of the video frame continuously photoed with the 
single camera, and, generally is called a shot. And video segments and a sound 
segment are contained in a segment, and it becomes a basic unit in video structure. In 
these segments, many definitions are possible about especially a sound segment, and 
a thing as shown below as an example can be considered. First, the sound segment 
can appoint a boundary by the silent period in the video data detected by the method 
generally known well, and may be formed. A sound segment as indicated to "D.Kimber 
and L. Wilcox, Acoustic Segmentation for Audio Browsers, and Xerox Pare Technical 
Report", For example, it may be formed from a series of a sound, music, a noise, and 
the audio frame classified into a small number of category like silent **. A sound 
segment, "S. Pfeiffer.S. Fischer and E. Wolfgang, Automatic Audio Content Analysis, 
Proceeding of ACMMultimedia 96, Nov. 1996. The big change in a certain feature 
between two continuous audio frames is detected as a voice cut point, and it may be 
determined based on this voice cut point as indicated to pp21-30." 
[0020]In such a video data, with a similar chain. It is mutually similar, and it is two or 
more segments which were able to be set in order in time, and the structural patterns 
are classified into some kinds according to the constraints which should be fulfilled as 
the relation between the similar segments contained in the chain concerned, and a 
structure of a chain. Formally, similar chains are a series of segments of which j= 1, 
k-1:ij<ij+i consists about all the segments, when the segment which the similar chain 
concerned contains is expressed with Si,, S\^^. Index ij expresses the segment 
number in the video data of the origin of the segment here, and subscript j to i means 



that the segment is located on a time-axis in the similar chain concerned the j-th. 
Since a discontinuous segment is contained in a similar chain in time, a time gap may 
exist between the elements of a chain. If it puts in another way, segment and S^i 
will not necessarily continue in the original video data. 

[0021 ]By using a similar chain, the leading key about both the local video structures 
and global video structures which are mentioned later can be acquired in a video data. 
Generally the key as which a televiewer can grasp the outline perceptually exists in a 
video data. Things simplest as this key and important are similar structural patterns of 
video segments or a sound segment, and are the information which should gain just 
these structural patterns with a similar chain. 

[0022]As such a similar chain, there are a basic similar chain, a link similar chain, a 
partial chain, and a periodic chain, and these are the most important and fundamental 
in video-data analysis so that it may explain in full detail behind. 
[0023]Here, with a basic similar chain, all the segments which the basic similar chain 
concerned contains are mutually similar. However, there are no restrictions in the 
structural patterns. Generally such a basic similar chain can be obtained using the 
grouping algorithm or clustering algorithm for carrying out grouping of the segment. 
With a link similar chain, the segment which adjoins in the chain is mutually similar. A 
partial chain has a time interval smaller than predetermined time between segments in 
each set of the adjoining segment. And each segment is [ chain / periodic ] similar 
with the segment of the m-th back rather than it. That is, a periodic chain comprises 
that m segments are repeated approximately. 

[0024]And such a similar chain can be used for extracting the local video structure of 
a scene, for example, the global video structure of a news item, also in a video data as 
shown below. 

[0025]In order to describe a video data to be a scene on a higher level based on the 
semantic content, here. Grouping is carried out to the settlement which is meaningful 
using the characteristic quantity showing the feature of a segment called the amount 
of perceptual activities in a segment for example in the segment obtained by 
video-segments (shot) detection or sound segment detection. Although the scene 
was subjective and it was dependent on the contents or the genre of a video data, the 
characteristic quantity should carry out grouping of the repetitive pattern of the video 
segments which show similarity mutually, or a sound segment here. 
[0026] Now, as an example of a similar chain of extracting the local video structure 
mentioned above, as shown in drawing 2 . in the scene where two speakers are talking 
mutually, video segments consider the case where it appears by turns according to a 
speaker. In the video data which has such a repetitive pattern, each video segments 
are constituted for every ingredient of A ingredient and B ingredient by two crossing 
chains. Therefore, generally such a crossing partial chain can be used for detecting 
the related group or scene of video segments. 



[0027]As an example of a similar chain of extracting the global video structure 
mentioned above, as shown in drawing 3 , the news program which has fixing structure 
is considered. In such a video data, the segment to which a newscaster introduces an 
item for every news item appears first, and the segment which a special 
correspondent reports, for example from a spot appears following it. In the video data 
which has such fixing structure, the video segments of the newscaster who appears 
repeatedly constitute a global chain. Here, since a newscaster's segment shows the 
start part of each news item, it can detect a news item automatically by using a global 
chain. That is, in the figure, each topic is detectable by using a global chain out of the 
video data which comprises two or more news items called the topic A, B, and C, D, 
and ... 

[0028]The video voice processing unit 10 shown in drawing 4 as an embodiment which 
applied this invention, The similar chain which measured and mentioned the similarity 
between segments above using the characteristic quantity of the segment in the video 
data mentioned above is detected automatically, and it can apply to both video 
segments and a sound segment. And the video voice processing unit 1 0 can extract 
and reconstruct the structure of high level, such as a scene which is local video 
structure, and a topic which is global video structure, from a video data by analyzing a 
similar chain. 

[0029]The video voice processing unit 10 is provided with the following. 

The video dividing part 1 1 which divides the stream of the inputted video data into the 

segment of an image, sounds, or these both as shown in the figure. 

The video segment memory 12 which memorizes the partition information of a video 

data. 

The image feature quantity extracting part 13 which is a feature amount extracting 
means which extracts the characteristic quantity in each video segments. 
The voice feature amount extraction part 14 which is a feature amount extracting 
means which extracts the characteristic quantity in each sound segment, The 
segment characteristic quantity memory 15 which memorizes the characteristic 
quantity of video segments and a sound segment, The chain primary detecting 
element 16 which is a detection means to summarize video segments and a sound 
segment to a chain, the characteristic quantity similarity test section 1 7 which is the 
similarity measuring means which measure the similarity between two segments, and 
the chain analyzing parts 18 which are analysis means to detect various video 
structures. 

[0030]The video dividing part 11, for example MPEG1 (Moving Picture Experts Group 
phase 1) and MPEG 2 (Moving Picture Experts Group phase 2). Or the stream of the 
video data which consists of the picture image data and voice data in the digitized 
format of versatility including a compression video-data format like what is called DV 



(Digital Video) is inputted, This video data is divided into the segment of an image, 
sounds, or these both. This video dividing part 1 1 can be processed directly, without 
carrying out full extension of this compressed video data, when the inputted video 
data is a compression format. The video dividing part 1 1 processes the inputted video 
data, and divides it into video segments and a sound segment. The video dividing part 
1 1 supplies the partition information which is the result of dividing the inputted video 
data to the latter video segment memory 1 2. The video dividing part 1 1 supplies 
partition information to the latter image feature quantity extracting part 13 and the 
voice feature amount extraction part 14 according to video segments and a sound 
segment. 

[0031]The video segment memory 12 memorizes the partition information of the video 
data supplied from the video dividing part 1 1 . The video segment memory 1 2 supplies 
partition information to the chain primary detecting element 16 according to the 
inquiry from the chain primary detecting element 16 mentioned later. 
[0032]The image feature quantity extracting part 13 extracts the characteristic 
quantity for every video segments obtained by dividing a video data by the video 
dividing part 11. The image feature quantity extracting part 13 can be processed 
directly, without carrying out full extension of the compression video data. The image 
feature quantity extracting part 13 supplies the characteristic quantity of each 
extracted video segments to the latter segment characteristic quantity memory 15. 
[0033]The voice feature amount extraction part 14 extracts the characteristic 
quantity for every sound segment obtained by dividing a video data by the video 
dividing part 11. The voice feature amount extraction part 14 can be processed 
directly, without carrying out full extension of the compression audio data. The voice 
feature amount extraction part 14 supplies the characteristic quantity of each 
extracted sound segment to the latter segment characteristic quantity memory 1 5. 
[0034]The segment characteristic quantity memory 15 memorizes the characteristic 
quantity of the video segments supplied, respectively from the image feature quantity 
extracting part 13 and the voice feature amount extraction part 14, and a sound 
segment. The segment characteristic quantity memory 15 supplies the characteristic 
quantity and the segment which have been memorized to the characteristic quantity 
similarity test section 17 according to the inquiry from the characteristic quantity 
similarity test section 1 7 mentioned later. 

[0035]The chain primary detecting element 16 summarizes video segments and a 
sound segment to a chain using the partition information held at the video segment 
memory 1 2, and the similarity between 1 paired segments, respectively. It starts from 
each segment in a group, and the chain primary detecting element 16 detects the 
repetitive pattern of a segment similar out of a segment group, and summarizes such a 
segment to the chain. This chain primary detecting element 16 determines the last set 
of a chain using the 2nd filtering phase, after summarizing the initial candidate of a 



chain. And the chain primary detecting element 16 supplies the detected chain to the 
latter chain analyzing parts 18. 

[0036]The characteristic quantity similarity test section 17 measures the similarity 
between two segments. The characteristic quantity similarity test section 1 7 asks the 
segment characteristic quantity memory 15 that the characteristic quantity about a 
certain segment is searched. 

[0037]The chain analyzing parts 18 analyze the chain structure detected by the chain 
primary detecting element 16. and detect various local video structures and global 
video structures. These chain analyzing parts 18 can adjust those details according to 
specific application so that it may mention later. 

[0038]Such a video voice processing unit 10 detects video structure by performing a 
series of processings in which an outline is shown in drawing 5 using a similar chain. 
[0039]First, the video voice processing unit 10 performs video division in Step SI, as 
shown in the figure, namely, the video data as which the video voice processing unit 1 0 
was inputted into the video dividing part 1 1 — either video segments or a sound 
segment — or if possible, it will divide into the both. The video voice processing unit 
10 does not provide a prerequisite requirement in particular in the video split method 
to apply. For example, the video voice processing unit 10, "G. Ahanger and T.D.C. 
Little, A survey of technologies for parsing and indexing digital video, J. of Visual 
Communication Video division is performed by a method which is indicated to and 
Image Representation 7:28-4 and 1996." The method of such video division shall be 
well learned for the technical field concerned, and the video voice processing unit 10 
shall apply any video split methods. 

[0040]Then, the video voice processing unit 10 extracts characteristic quantity in 
Step S2. That is, the video voice processing unit 10 calculates the characteristic 
quantity showing the feature of the segment by the image feature quantity extracting 
part 13 or the voice feature amount extraction part 14. In the video voice processing 
unit 10, it is calculated as image characteristic quantity called the time length, color 
histogram, and texture feature of each segment, voice feature amounts, such as a 
frequency analysis result, a level, and a pitch, and characteristic quantity which an 
activity measurement result etc. can apply, for example. Of course, the video voice 
processing unit 10 is not limited to these as applicable characteristic quantity. 
[0041]Then, the video voice processing unit 10 performs similarity measurement of 
the segment using characteristic quantity in Step S3. That is, the video voice 
processing unit 10 performs dissimilarity nature measurement by the characteristic 
quantity similarity test section 1 7, and measures how many two segments are similar 
by the metrics. The video voice processing unit 10 calculates dissimilarity nature 
metrics using the characteristic quantity extracted in previous Step S2. 
[0042]Then, the video voice processing unit 10 detects a chain in step S4. That is, the 
video voice processing unit 10 detects the chain of a similar segment using the 



dissimilarity nature metrics calculated in previous Step S3, and the characteristic 
quantity extracted in previous Step S2. 

[0043]And the video voice processing unit 10 analyzes a chain in Step S5. That is, the 
video voice processing unit 10 determines and outputs the local video structure 
and/or global video structure of a video data using the chain detected in previous step 
S4. 

[0044] By passing through such a series of processings, the video voice processing 
unit 10 can detect video structure from a video data. Therefore, a user becomes 
possible [ performing the indexing and abstract of the contents of a video data, or 
accessing the interested point in a video data promptly ] by using this result. 
[0045] Hereafter, the processing in the video voice processing unit 10 shown in the 
figure is explained in detail by every process. 

[0046] First, the video division in Step SI is explained, the video data as which the 
video voice processing unit 10 was inputted into the video dividing part 1 1 — either 
video segments or a sound segment — or it dividing into the both, if possible, but. The 
art for detecting the boundary of the segment in this video data automatically has 
many things, and it is as having mentioned above in the video voice processing unit 10 
concerned not to establish a prerequisite requirement special to this video split 
method. On the other hand, in the video voice processing unit 10, it depends for the 
accuracy of the chain detection by a next process on the accuracy of the video 
division used as the foundation intrinsically. 

[0047]Below, the characteristic quantity extraction in Step S2 is explained. 
Characteristic quantity is the attribute of the segment which supplies the data for 
measuring the similarity between different segments while expressing the feature of a 
segment. The video voice processing unit 10 calculates the characteristic quantity of 
each segment by the image feature quantity extracting part 1 3 or the voice feature 
amount extraction part 14, and expresses the feature of a segment. Although it does 
not depend for the video voice processing unit 10 on the concrete detail of any 
characteristic quantity, there is a thing like the image characteristic quantity, the 
voice feature amount, and the amount of video voice common characteristics which 
are shown below, for example as characteristic quantity which uses in the video voice 
processing unit 10 concerned, and is considered to be effective. The necessary 
condition of such characteristic quantity which becomes applicable in the video voice 
processing unit 10 is that measurement of dissimilarity nature is possible. Such 
characteristic quantity needs to make it possible to perform simultaneously 
characteristic quantity extraction and video division mentioned above for increase in 
efficiency of the video voice processing unit 10. The characteristic quantity explained 
below fulfills these necessary conditions. 

[0048]As characteristic quantity, the thing about an image is mentioned first. Below, 
this will be called image characteristic quantity. Since video segments are constituted 



by the continuous video frame, by extracting a suitable video frame from video 
segments, they can represent the contents of depiction of the video segments with 
the extracted video frame, and can express them. That is, the similarity of the video 
frame extracted appropriately can be substituted for the similarity of video segments. 
This to image characteristic quantity is one of the important characteristic quantity 
which can be used with the video voice processing unit 10. If the image characteristic 
quantity in this case is independent, only static information can be expressed, but the 
video voice processing unit 10 can also extract the dynamic feature of video 
segments based on this image characteristic quantity by applying a method which is 
mentioned later. 

[0049]In the video voice processing unit 10, the color in an image serves as an 
important material at the time of judging whether two images are similar. Judging the 
similarity of an image using a color histogram, For example, "G. Ahanger and T.D.C. 
Little, A survey of technologies for parsing and indexing digital video, J. of Visual 
Communicati It is well known as indicated to on andlmage Representation 7:28-4 and 
1 996." Here, with a color histogram, three-dimensional color spaces, such as HSV and 
RGB, are divided into n fields, for example, and the relative ratio of the frequency of 
occurrence in each field of the pixel in an image is calculated. And a n vector is given 
from the acquired information. Also about the compressed video data, a color 
histogram can be extracted directly from compressed data as indicated, for example in 
the U.S. Patent # No. 5,708.767 gazette. 

[0050]In the video voice processing unit 10, the 2^*"** ^=64 dimension histogram vector 
which carried out the sample of the YUV color space from the first in the image which 
constitutes a segment, and constituted it from 2 bits per color channel is used. 
[0051] Although such a histogram expresses the overall color tone of an image, the 
hour entry is not included in this. Then, in the video voice processing unit 10, imagery 
correlation is calculated as another image characteristic quantity. In the chain 
detection in the video voice processing unit 10, the structure which two or more 
similar segments intersected mutually serves as a leading index which shows that it is 
one chain structure whose it settled. For example, in a conversation scene, although 
the position of a camera moves by turns between two speakers, a camera returns to 
the almost same position, when usually photoing the same speaker again. In such a 
case, in order to detect the structure where it can set. Since the correlation based on 
the contraction image of a gray scale found out becoming an index with the good 
similarity of a segment, in the video voice processing unit 10, the original image is 
thinned out to the gray scale image of the size of MxN, it reduces, and imagery 
correlation is calculated using this. Here, a value with small both is enough as M and N. 
for example, they are 8x8. That is, these reduction gray scale images are interpreted 
as a feature amount vector of MN dimension. 

[0052]The thing about a sound is mentioned as different characteristic quantity from 



the image characteristic quantity furthermore mentioned above. Below, this 
characteristic quantity will be called a voice feature amount. A voice feature amount 
is the characteristic quantity which can express the contents of the sound segment, 
and frequency analysis, a pitch, a level, etc. can be used for the video voice 
processing unit 10 as this voice feature amount. These voice feature amounts are 
known with various articles. 

[0053]First, the video voice processing unit 10 can determine distribution of the 
frequency information in a single audio frame by conducting frequency analysis, such 
as the Fourier transform. Since distribution of the frequency information covering one 
sound segment is expressed, the characteristic quantity of an FFT (Fast Fourier 
Transform; Fast Fourier Transform) ingredient, a ft-equency histogram, a power 
spectrum, and others can be used for the video voice processing unit 1 0. for example. 
[0054]Sound levels, such as pitches, such as an average pitch and a maximum pitch, 
average loudness, and the maximum loudness, can also use the video voice processing 
unit 10 as an effective voice feature amount showing a sound segment. 
[0055]The video voice processing unit 10 contains a cepstrum coefficient and its 
primary secondary differential quotient as cepstrum characteristic quantity, The 
cepstrum spectrum coefficient obtained from an FFT spectrum or LPC (Linear 
Predictive Coding; linear predictive coding) can also be used. 

[0056]As characteristic quantity of further others, the amount of video voice common 
characteristics is mentioned. Although this is not image characteristic quantity, either 
and is not a voice feature amount, either, in the video voice processing unit 10. it gives 
useful information to expressing the feature of the segment in a chain. An activity is 
used for the video voice processing unit 10 as this amount of video voice common 
characteristics. 

[0057]An activity is an index showing whether the contents of the segment are 
sensed to be how much dynamic or static. For example, when visually dynamic, an 
activity expresses the degree from which the degree which a camera moves promptly 
in accordance with a subject, or the object currently photoed changes promptly. 
[0058]This activity is indirectly calculated by measuring the average value of the 
inter-frame dissimilarity nature of characteristic quantity like a color histogram. Here, 
if the dissimilarity nature metrics over the characteristic quantity F measured 
between the frame i and the frame j are defined as dp (i, j), image activity Vp will be 
defined like a following formula (1). 




[0059] 
[Equation 1] 
/-I 




[0060]In a formula (1). b and f are the frame numbers of a frame of the beginning and 
the last in one segment, respectively. Specifically, the video voice processing unit 10 
can calculate image activity Vp using a histogram mentioned above, for example. 
[0061] By the way, fundamentally, although it is as having mentioned above that it is a 
thing showing static information of a segment, characteristic quantity including image 
characteristic quantity mentioned above also needs to take dynamic information into 
consideration, in order to express the feature of a segment correctly. Then, suppose 
the video voice processing unit 10 that dynamic information is expressed with 
sampling of characteristic quantity as shown below. 

[0062]The video voice processing unit 10 extracts one or more static characteristic 
quantity from a time of differing in 1 segment, as shown, for example in drawing 6 . At 
this time, the video voice processing unit 1 0 is determined by balancing maximization 
of fidelity and minimization of data relative redundancy. [ in / for the number of 
extraction of characteristic quantity / that segment expression ] For example, when 
one certain segment picture can specify as a key-frame of the segment concerned, a 
histogram calculated from the key-frame serves as sampling characteristic quantity 
which should be extracted. 

[0063]By the way, a certain sample always considers a case where it is chosen, for 
example at the time of the last in a segment, at the predetermined time. In this case, 
about two arbitrary segments which change to a black frame (fade), since a sample 
serves as the same black frame, there is a possibility of bringing a result from which 
the same characteristic quantity is obtained. That is, image contents of these 
segments are what kind of things, and it will be Judged that that and two selected 
frames are extremely similar. Since a sample is not a good central value, such a 
problem is generated. 

[0064]Then, suppose the video voice processing unit 10 that characteristic quantity is 
not extracted in this way in the fixed point, but a statistical central value in the whole 
segment is extracted. Here, sampling of general characteristic quantity is explained 
about two cases, i.e.. when (1) characteristic quantity can be expressed as a n vector 
of the real number, and a case where only (2) dissimilarity nature metrics can be used. 
Image characteristic quantity and voice feature amounts which are known best, such 
as a histogram and a power spectrum, are contained in (1). 

[0065]In (1), a priori, a sample number is decided to be k and the video voice 
processing unit 10, To Kaufman and P.J. Rousseeuw, Finding Groups in DataiAn 
Introduction to Cluster Analysis, John-Wiley and sons, 1990." Characteristic quantity 
about the whole segment is automatically divided into k different groups using the k 
average value clustering method (k-means-clustering method) which may be indicated 
and is known. And the video voice processing unit 10 chooses each group to k groups' 
centroid value (centroid), or a sample near this centroid value as a sampled value. 



Complexity of this processing in the video voice processing unit 10 remains for only 
increasing linearly about a sample number. 

[0066]On the other hand in (2), the video voice processing unit 10, To "L. Kaufman 
and P.J.Rousseeuw, Finding Groups in Data:An Introduction to Cluster Analysis, 
John-Wiley and sons, 1 990/' k indicated - a prospect — ide — k groups are formed 
using the algorithm method (k-medoids algorithm method), and a prospect of a group 
who mentioned above the video voice processing unit 10 for every k groups as a 
sampled value — the ide (medoid) is used. 

[0067]In the video voice processing unit 10. although based on dissimilarity nature 
metrics of static characteristic quantity used as the foundation, this is later 
mentioned for a method of constituting dissimilarity nature metrics about 
characteristic quantity showing extracted dynamic features. 

[0068]Thus, the video voice processing unit 10 can express dynamic features by 
extracting two or more static characteristic quantity, and using static characteristic 
quantity of these plurality. 

[0069]As mentioned above, the video voice processing unit 10 can extract various 
characteristic quantity. It is common for each of such characteristic quantity to be 
insufficient for generally, expressing the feature of a segment, if single. Then, the 
video voice processing unit 10 can choose a group of characteristic quantity mutually 
complemented with combining such various characteristic quantity. For example, the 
video voice processing unit 10 can acquire many information rather than information 
which each characteristic quantity has by combining a color histogram and imagery 
correlation which were mentioned above. 

[0070]Below, similarity measurement of a segment using characteristic quantity in the 
step S3 in drawing 5 is explained. The video voice processing unit 10 performs 
similarity measurement of a segment by the characteristic quantity similarity test 
section 1 7 about two characteristic quantity using dissimilarity nature metrics which 
are a function which calculates the real value which measures how much it is a 
dissimilarity. When that value of these dissimilarity nature metrics is small, it is shown 
that two characteristic quantity is similar, and when a value is large, it is shown that it 
is a dissimilarity. Here, a function which calculates the dissimilarity nature of two 
segment about the characteristic quantity F and S2 is defined as dissimilarity 
nature metrics dp (Si, Sg). Such a function satisfies a relation given by the following 
formulas (2). 
[0071] 
[Equation 2] 




1 4:-?iit 
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[0072]By the way, although some of dissimilarity nature metrics are applicable only to 
a certain specific characteristic quantity. "G. Ahanger and T.D.C. Little, A survey of 
technologies forparsing and indexing digital video, J. of Visual Communication and 
Image Representation. 7: 28-4, 1996", and "L. Kaufman and P.J. Rousseeuw, Finding 
Groups in Data:An Introduction to Cluster Analysis, John-Wiley and sons,. Generally 
many dissimilarity nature metrics are applicable to measuring the similarity about the 
characteristic quantity expressed as a point in an n-space as indicated to 1 990/' The 
example is Euclidean distance, an inner product, LI distance, etc. Here, since 
especially LI distance acts effectively to various characteristic quantity containing 
characteristic quantity, such as a histogram and imagery correlation, the video voice 
processing unit 10 introduces LI distance. Here, when two n vectors are set to A and 
B. LI distance dL, (A, B) between A and B is given with a following formula (3). 



[0074]Here, the character i with the bottom shows each i-th element of n vectorA 
and B. 

[0075]The video voice processing unit 10 extracts the static characteristic quantity in 
various times in a segment as characteristic quantity showing dynamic features, as 
mentioned above. And in order to determine the similarity between the two extracted 
amounts of dynamic features, the dissimilarity nature metrics between the amounts of 
static features used as the foundation are used for the video voice processing unit 10 
as the dissimilarity nature metrics. It is best that the dissimilarity nature metrics of 
these amounts of dynamic features are determined using a pair of dissimilarity nature 
value of the most similar amount of static features chosen from each amount of 
dynamic features in many cases. In this case, the dissimilarity nature metrics between 
two extracted amount SF, of dynamic features and SFg are defined like a following 
formula (4). 



[0073] 
[Equation 3] 





[0076] 
[Equation 4] 





[0077]Here, function dp (F,, Fg) in an upper type (4) shows dissimilarity nature metrics 
about the amount F of static features used as the foundation. It is very good in the 
maximum or average value instead of taking the minimum of the dissimilarity nature of 
characteristic quantity depending on the case. 

[0078]By the way, when the video voice processing unit 10 determines the similarity 
of a segment, just its single characteristic quantity is insufficient, and it needs to 
combine information from characteristic quantity of a large number about the same 
segment in many cases. As this one method, the video voice processing unit 10 
calculates dissimilarity nature based on various characteristic quantity as a 
combination with dignity of each characteristic quantity. That is, when k characteristic 
quantity F^, F2, F,, exist, dissimilarity nature metrics dp (S,, S2) about combined 
characteristic quantity which is expressed with a following formula (5) is used for the 
video voice processing unit 10. 
[0079] 
[Equation 5] 

a,(s, . s,) = 1; wd^ls^ ,5,} ... (5) 

[0080]Here, {Wj} is a weighting factor used as sigmajWj=1. 

[0081]As mentioned above, the video voice processing unit 10 can calculate 
dissimilarity nature metrics using characteristic quantity extracted in the step S2 in 
drawing 5 , and can measure similarity between the segments concerned. 
[0082]Below, chain detection in step S4 in drawing 5 is explained. The video voice 
processing unit 10 detects a similar chain showing relation between similar segments 
using dissimilarity nature metrics and extracted characteristic quantity. Here, first, 
here a similar chain of some types is defined and an algorithm for detecting a similar 
chain of each type is explained concretely. 

[0083]by the way, a type of a similar chain defined below is mutually-independent 
respectively — since the bottom is a thing, in the video voice processing unit 10, one 
chain is able to belong to two or more types. Here, such a chain will be called 
combining a defined type name. For example, a partial uniform link chain is local, is 
uniform, and shows a thing of a link similar chain so that it may mention later. 
[0084] Now, a type of a similar chain is divided roughly into what has restrictions in a 
relation between similar segments which the similar chain concerned contains, and a 
thing which has restrictions in structure of the similar chain concerned. Suppose that 
the chain C expresses a series of segment Si^, Si„ in the following definitions. Index 
i,^ expresses a segment number in the original video data of the segment here, and 
subscript k to i means that the segment is located on a time-axis in the similar chain 
concerned the k-th. On a time-axis, these segments of a series of shall always be set 



in order, and are iK<ik+i about all the k= 1 m-1. |C| shall express the length of a chain 

and C^^'^ and C*"^ shall express start time and finish time of the chain C in a video data, 
respectively. More correctly, start time of the chain C Is the start time of the first 
segment in the chain C, and finish time of the chain C is the finish time of a segment 
of the last in the chain C. a case where a certain segment is set to A further again — 
the similar segment — A' and A — it expresses with A"\ and ... Finally it supposes 
that it is in a state where those dissimilarity nature metrics are smaller than a 
dissimilarity nature threshold mentioned later that two segments are similar, and this 
is expressed with similar (S,, Sj). 

[0085]As a similar chain which has restrictions, a basic similar chain, a link similar 
chain, and a periodic chain have a relation between similar segments which the similar 
chain concerned contains. 

[0086] First, although it is a basic similar chain, this is the chain C with which all the 
segments were mutually similar, as shown in drawing 7 . There are no structural 
restrictions in a basic similar chain. In many cases, this basic similar chain is obtained 
as a result of a grouping algorithm for carrying out grouping of the similar segment, or 
a clustering algorithm. 

[0087]On the other hand, as it is indicated in drawing 8 as a link similar chain, an 
adjoining segment is the mutually similar chain C. That is, in a link similar chain, it is 

similar (S^. S^^,) about all the k= 1, , and |C|-1. A' from a definition of a similar 

segment which this link similar chain mentioned above, and A — it can be described as 
A'", and ... 

[0088]As it is Indicated in drawing 9 as a periodic chain, each segment is a segment of 
the m-th back, and similar chain C^^^y,^. That is. in a periodic chain, it is similar (S^, S^+i) 

about all the k= 1 ICcyciicI"^- puts in another way, a periodic chain is constituted 

as an approximate repetition of a series of m segments. From this, a periodic chain. S 

, S 2 ^ m S 1 S 2 — ^ m ^ 1 

' — S ~ 2 — ' ~ ' ... S ~ ^ ~ ' — • — ** — S — , ~ ' — ' — ' — S ~ 2 — ' — ' ~ 
' .... It can be described as S, 
[0089]On the other hand, there are a partial chain and a uniform chain as a similar 
chain which has structural restrictions. 

[0090]Here, in each set of an adjoining segment, a partial chain is the chain C whose 
time interval between segments is smaller than predetermined time, as mentioned 
above. Namely, If the maximum of a time interval permitted between two segments in a 
chain is expressed in a partial chain as gap. It is ii,^^-i,,<=gap to segment Si,, and S\^^+^ 
which adjoin about all the k= 1, .... |C|-1. 

[0091]When a segment in a chain appears in an almost equal time interval, this can 
serve as a leading index of important video structure, but such a chain C is defined as 
a uniform chain. Here, homogeneous uniformity (C) of the chain C is defined like as a 
thing which is shown in a following formula (6) and which standardized average value of 



a gap of a time interval from regular-intervals time by the length of the chain. 

[0092] 

[Equation 6] 




[0093]Homogeneous uniformity (C) of the chain C shown by an upper formula (6) 
takes the value of the range of 0 to 1, and when the value is small, it shows that it is 
close to distribution with uniform time interval distribution of a segment. When the 
value of this homogeneous uniformity (C) is smaller than a homogeneous 
predetermined threshold, it is considered that the chain C is a uniform chain. 
[0094] Hereafter, in the video voice processing unit 10, the processing for detecting 
each of such various chains is explained. 

[0095]In order to detect a basic similar chain mentioned above, batch clustering 
technology or serial clustering technology is used for the video voice processing unit 
10. 

[0096]Batch clustering technology is the art of detecting a chain collectively. 
However, in order to apply this art, before performing chain detection, it is necessary 
to end all the video division. One serial clustering technology is the art of detecting a 
chain sequentially, and supposing video division and characteristic quantity extraction 
are performed sequentially again, it will become possible to conduct video analysis 
sequentially, playing a video data, there is sufficient count ability for the video voice 
processing unit 10 — if it becomes — this successive chain detection — real time — 
if it puts in another way. a chain is detectable while taking in or recording a video data. 
However, successive video analysis may produce a problem for the accuracy. That is, 
in the case of a successive method, there is no global information for determining the 
optimal chain structure, and since it is still more sensitive to an entry sequenced 
foreword of a segment, a result of low quality may be produced. 

[0097]When using batch clustering technology, the video voice processing unit 10 
detects a basic similar chain by passing through two processes, as shown in drawing 
10. 

[0098]First, the video voice processing unit 10 detects a candidate chain in Step SI 1 . 
Namely, the video voice processing unit 10 detects a similar segment in a video data, 
and summarizes it as a cluster. A cluster group of a segment obtained by this serves 
as an initial candidate when detecting a basic similar chain. 

[0099]When the video voice processing unit 10 asks for an initial candidate of a similar 



chain, arbitrary clustering technologies can be used for it, but. Here, To L Kaufman 
and P.J. Rousseeuw, Finding Groups in Data:An Introduction to Cluster Analysis, 
John-Wiley and sons, 1 990," A hierarchical clustering method (hierarchical clustering 
method) indicated will be used. This algorithm starts by summarizing two most similar 
segments as one pair first, and summarizes one pair of a cluster which was most 
similar in each stage after another using similarity metrics between clusters. In this 
algorithm, dissimilarity nature metrics d^ (C,, Cj) between two cluster C, and C2 is 
defined as minimum dissimilarity nature between two segments contained in each 
cluster, as shown in a following formula (7). 



[0101]In the video voice processing unit 10, the maximum function or an average 
function may be used instead of the minimum function shown by an upper formula (7) 
if needed. 

[0102]By the way, this hierarchical clustering method will summarize all the segments 
contained in a video data in a single group, when there are no restrictions temporarily. 
Then, as shown in drawing 1 1 , the video voice processing unit 10 introduces 
dissimilarity nature threshold delta^j^, and judges whether a certain segment is similar 
to the segment of another side by comparison with this dissimilarity nature threshold 
delta^j^. As it is indicated in the figure as dissimilarity nature threshold deltagj^ here, 
when how many two segments are similar, it is a threshold which determines whether 
to regard it as the thing belonging to the same chain. And the video voice processing 
unit 10 summarizes the segment to the cluster in the range in which the dissimilarity 
nature of all the cluster pair does not exceed this dissimilarity nature threshold 

[0103]It may be made for the video voice processing unit 10 to set up dissimilarity 
nature threshold delta^j^ by a user, and it may determine it automatically. However, 
when using a fixed value as dissimilarity nature threshold delta^,^, it will depend for the 
optimum value on the contents of the video data. For example, in the case of a video 
data which has the image contents which were varied, dissimilarity nature threshold 
de\ta^:„ needs to be set as a high value. In the case of a video data which, on the other 
hand, has image contents with little change, dissimilarity nature threshold delta^i^ 
needs to be set as a low value. Generally a cluster number detected when dissimilarity 
nature threshold delta^j^ is high decreases here, and when dissimilarity nature 
threshold deltagi^ is low, a cluster number detected has the character to increase. 
[0104]From this, in the video voice processing unit 10, when determining suitable 



[0100] 
[Equation 7] 





dissimilarity nature threshold deltdg^^ influences the performance, it becomes 
important. Therefore, in the video voice processing unit 10, to set up dissimilarity 
nature threshold delta^^^ by a user, after taking having mentioned above into 
consideration, it is necessary to set up. On the other hand, the video voice processing 
unit 10 can also determine effective dissimilarity nature threshold delta^j^ 
automatically by a method shown below. 

[0105]For example, the video voice processing unit 10 can obtain dissimilarity nature 
threshold deltasi^ as the one method using a statistics value called average value and 
a median (median) in distribution of dissimilarity nature between (n) (n-l) / 2 segment 
pairs. When average value and standard deviation of dissimilarity nature in all the 
segment pairs are set to mu and sigma now, respectively, dissimilarity nature 
threshold deltagj^ can be expressed with form of amu+bsigma. Here, a and b are 
constants and it has found out giving a result with respectively good setting it as 0.5 
and 0.1. 

[0106]On practical use, the video voice processing unit 10, What is necessary is not to 
ask for dissimilarity nature between them, and for the average value mu and the 
standard deviation sigma to choose from all the segment pair sets at random a 
segment pair which is sufficient for giving a result sufficiently near a true value, and 
just to ask for the dissimilarity nature about all the segment pairs. The video voice 
processing unit 10 can obtain suitable dissimilarity nature threshold delta^j^ 
automatically by using the average value mu acquired by doing in this way. and the 
standard deviation sigma. That is, the video voice processing unit 10 can determine 
suitable dissimilarity nature threshold deltagj^ automatically by extracting a number of 
dissimilarity nature of a segment pair given by Cn, when total of a segment pair is set 
to n and arbitrary small constants are set to C for example. 

[0107]As the video voice processing unit 10 had been shown so far. after clustering a 
segment, it can obtain an initial candidate of a basic similar chain by rearranging a 
segment contained in each cluster concerned in each cluster. 

[0108]By the way, video structure with the actual many of a chain candidate who 
detected in the step S11 in drawing 10 is unrelated. From this, the video voice 
processing unit 10 needs to determine whether to be an important chain with which 
which chain candidate makes a skeleton of video structure, or be a chain relevant to 
video structure. Therefore, the video voice processing unit 10 performs chain filtering 
using quality metrics corresponding to a numerical standard which shows quality of a 
chain in Step SI 2. Namely, the video voice processing unit 10 measures a chain 
candidate's importance and relevance in video structure analysis, and outputs them as 
a result of chain detection of only a chain candidate who exceeds a predetermined 
quality metrics threshold. Here, although the simplest example as a relevance 
measurement function used by filtering is a Boolean function which shows whether a 
chain candidate is accepted, as for the video voice processing unit 10, a more 



complicated relevance measurement function may be used if needed. 

[0109]By the way, in the video voice processing unit 10, chain length, chain density, 

chain strength, etc. are used as chain quality metrics. 

[OIIOjFirst. although it is chain length, this is defined as the number of segments 
which one chain holds. Here, generally it is when chain length is small that the video 
voice processing unit 10 can use this chain length as those chain quality metrics, and 
it depends on usually regarding as a noise being possible. For example, when a certain 
chain has only unisegment, it does not have any information. That is, in quality metrics 
based on chain length, the minimum of the number of segments which a chain should 
hold will be given as the restrictions. 

[01 1 1jNext, although it is chain density, this is defined as a ratio of the total number of 
segments which a certain chain holds, and the total number of segments in subregion 
of a video data which the chain occupies. This depends on that it may be more 
desirable to exist intensively in a segment of time to which a chain was restricted. In 
this case, this chain density should just be used for the video voice processing unit 10 
as those chain quality metrics. 

[01 12]Finally, although it is chain strength, this is an index which shows whether each 
segment in a chain is how much similar mutually, and it considers that the chain has 
high intensity, so that the segment concerned is mutually similar. In the video voice 
processing unit 10. about a method of measuring this chain strength. A large number 
exist including a similarity measuring method in a chain shown below, a method of 
taking average value of dissimilarity nature between all the possible segment pairs, or 
a method of taking the maximum of dissimilarity nature between all the possible 
segment pairs. 

[0113]As an example, the video voice processing unit 10 shows a case where chain 
strength is measured with a similarity measuring method in a chain. Here, a similarity 
measuring method in a chain is a method of expressing the similarity of a segment 
which constitutes a chain as average value of the dissimilarity nature of each segment 
and the most typical segment that the chain contains. As an example of a typical 
segment, a center~of-gravity (centroid) segment of a chain is mentioned. If a 
center-of-gravity segment in the chain C is made into S^jgnt^^id now, this 
center-of-gravity segment S--_t_-:d will be defined by following formula (8). 



[01 15]Here, argmin in an upper type (8) means choosing input 8^**0 which makes the 
value of the formula of an evaluation object the minimum. 



[0114] 
[Equation 8] 





[01 16]From this, when chain strength is made into d^.^^^. this chain strength d^jen^roid 
expressed like a following formula (9). 
[0117] 
[Equation 9] 



cMroid 



McM^'^^] (9) 



[01 18]Now, the video voice processing unit 10 performs chain filtering using the chain 
quality metrics mentioned above by a series of processings as concretely shown in 
drawing 12 . 

[01 19]First, in Step S21. the video voice processing unit 10 makes filtering chained list 
^filtered ^ State whlle initializing chained list C,i3t with a candidate chain. 
[0120]Then, the video voice processing unit 10 distinguishes whether chained list C|j3t 
is a nil state in Step S22. 

[0121]Here, when chained list C|jst is a nil state, the video voice processing unit 10 

ends a series of processings from the target candidate chain not existing. 

[0122]On the other hand, when chained list C,;^^ is not a nil state, in Step S23, the 

video voice processing unit 10 makes a certain chain C the element of the beginning 

of chained list Cjig^. and removes the chain C from chained list C,ist. 

[0123]Then, the video voice processing unit 10 calculates chain quality metrics about 

the chain C in Step S24. 

[0124]And the video voice processing unit 10 distinguishes whether these chain 
quality metrics are larger than a quality metrics threshold in Step S25. 
[0125]Here, when chain quality metrics are smaller than a quality metrics threshold, 
the video voice processing unit 10 shifts processing to Step S22, and processing 
about another chain is performed again. 

[0126]On the other hand, when chain quality metrics are larger than a quality metrics 
threshold, the video voice processing unit 10 adds the chain C to filtering chained list 
Cfiitered Step S26. 

[0127]And the video voice processing unit 10 distinguishes whether chained list 0^3^ is 
a nil state in Step S27. 

[0128]Here. when chained list C,jst is a nil state, the video voice processing unit 10 
ends a series of processings from the target candidate chain not existing. 
[0129]On the other hand, when chained list C^^^^ is not a nil state, the video voice 
processing unit 10 shifts processing to Step S23. Thus, the video voice processing 
unit 10 repeats processing until chained list C^^^ will be in a nil state. 
[0130]By such a series of processings, the video voice processing unit 10 can perform 
chain filtering, and can determine which chain is a chain relevant to whether it is an 
important chain which makes a skeleton of video structure, and video structure. 
[0131]As mentioned above, the video voice processing unit 10 can detect a basic 



similar chain using such batch clustering technology. 

[0132]By the way, the video voice processing unit 10 can also detect a basic similar 
chain as an option using serial clustering technology mentioned above with batch 
clustering technology. That is, the video voice processing unit 10 processes every one 
segment in a video data according to order of the input, and repeats and updates a 
chain candidate list. Also in this case, like batch clustering technology, the video voice 
processing unit 10 divides a main process of chain detection into two steps, and is 
performed. That is, the video voice processing unit 10 detects a cluster of a similar 
segment first using a clustering algorithm one by one. Next, the video voice processing 
unit 10 filters a detected cluster using the same chain quality metrics as batch 
clustering technology. Here, in a point advanced in a stage where filtering of a chain is 
early, as filtering processing at the time of using clustering technology one by one, the 
video voice processing unit 10 differs from a case of batch clustering technology. 
[0133]Now, in clustering technology, when clustering a segment, a clustering algorithm 
is used one by one. By the way, generally almost all serial clustering is performed to 
partial optimum. That is, with a clustering algorithm, it is judged locally whether the 
segment is assigned to the existing cluster whenever a new segment is inputted, or 
the new cluster containing only the segment is generated one by one. There are some 
which update the cluster division itself whenever a new segment is inputted, in order 
to prevent a bias effect accompanying an entry sequenced foreword of a segment as 
an on the other hand more elaborate serial clustering algorithm. About such an 
algorithm. "J. Roure and L. Talavera, Robust incremental clustering with bad instance 
orderings:a new strategy,In Proceedings of the Sixth Iberoamerican. Conference on 
Artifical. Intelligence, IBERAMIA-98. Pages 136-147. Lisbon, Portugal. Helder Coelho 
ed..LNAI vol. 1484. Springer Verlag, 1998." *♦*♦♦* can be referred to. 
[0134]The video voice processing unit 10 performs processing as shown in drawing 13 
as an example of a clustering algorithm one by one. Here, a video data divided into a 

segment considers it as segment S,, , and a thing that has S^. Here, a series of 

processings also including a process of chain analysis are explained. 
[0135]First, as shown in the figure, in Step S31, the video voice processing unit 10 
initializes chained list Cjig^ to a nil state, and sets segment number i as 1 in Step S32. 
[0136]Next, the video voice processing unit 10 distinguishes whether segment number 
i is smaller than the total segment n [ several ] in Step S33. 

[0137]Here. since the target segment [ i / segment number / processing unit 7 10/ 
video voice ] when larger than the total segment n [ several ] does not exist, a series 
of processings are ended. 

[0138]On the other hand, segment number i in being smaller than the total segment n 
[ several ], In Step S34, the video voice processing unit 10 incorporates segment S,, 
and distinguishes whether chained list Cij^^ is a nil state in Step S35 at segment S|, i.e., 
here. 



[0139]Here, when chained list C^st is ^ >^>' state, the video voice processing unit 10 
shifts processing to Step S42. 

[0140]On the other hand, when chained list Cu^t >s not a nil state, the video voice 
processing unit 10 calculates chain whose dissimilarity nature to segment is 
the minimum in Step S36. Here, chain C^;„ is defined like a following formula (10). 
[0141] 

[Equation 10] 
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[0142]In an upper type (10), dsc (0, S) expresses the dissimilarity nature metrics 
between the chain C and the segment S, and is given with a following formula (11). 
[0143] 

[Equation 11] 
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[0144]In the upper type (7) which is the similarity metrics defined in batch clustering 
technology, this is equivalent to what made the 2nd argument the cluster having 
contained only the segment concerned. Below, suppose that minimum dissimilarity 
nature dgc (C^i^, Sj) between chain C^i„ and segment Sj is only expressed as d^|„. 
[0145]Next. the video voice processing unit 10 distinguishes whether minimum 
dissimilarity nature d^^^ is smaller than dissimilarity nature threshold delta^;^ in Step 
S37 using dissimilarity nature threshold deltagj^ which was explained in the case of 
batch clustering technology. 

[0146]Here when minimum dissimilarity nature d„j„ is larger than dissimilarity nature 
threshold deita^j^, In [ the video voice processing unit 10 shifts to processing of Step 
S42, generate new chain C^^^ which has only the segment Sj concerned as an only 
element segment, and ] Step S43. New chain C^^^ is added to chained list 0^3^, and it 
shifts to processing of Step S39. 

[0147]On the other hand, when minimum dissimilarity nature d„i„ is smaller than 
dissimilarity nature threshold deltasi^. in Step S38, as for the video voice processing 
unit 10. the segment Sj concerned is added to chain C„;„. That is, the video voice 
processing unit 10 is made into C^j„<-C„in**Si. 

[0148]And the video voice processing unit 10 filters a chain in Step S39. That is. as 
mentioned above, about each element chain C**C,ist, the video voice processing unit 
10 measures quality of the chain C. chooses only a chain which has quality metrics 
which exceed a quality metrics threshold, and adds this to chained list Cf^itered- 
[0149]The video voice processing unit 10 analyzes a chain sequentially in Step S40. 



That is. the video voice processing unit 10 lets filtered chained list C 



filtered 



in the time 



pass to an analysis module. 

[0150]And in Step S41, the video voice processing unit 10 adds 1 to segment number 
i, and shifts to processing of Step S33. 

[0151]Thus, the video voice processing unit 10 until segment number i becomes larger 
than the total segment n [ several ], A series of above processings are repeated and 
each element chain of chained list Cu^^ at the time of segment number i becoming 
larger than the total segment n [ several ] is detected as a basic similar chain. 
[01 52] A series of processings shown in the figure are premised on the total segment n 
[ several ] contained in an inputted video data being known. However, generally the 
total segment n [ several ] is not given beforehand in many cases. In that case, the 
clustering algorithm should just distinguish continuation or an end of processing by 
whether there is any input of a segment succeedingly in Step S33 in the said figure 
one by one. 

[0153]By such a series of processings, the video voice processing unit 10 can detect 
a basic similar chain which used clustering technology one by one. 
[0154]Processing which detects next a link similar chain mentioned above is explained. 
Detection of a link similar chain in the video voice processing unit 10 can be 
considered as a special case of basic similar chain detection. The video voice 
processing unit 10 performs processing as shown in drawing 14 as a link similar chain 
detecting method which used a clustering algorithm one by one. Here, a video data 
divided into a segment assumes that it has segment S^. S„. Here, a series of 
processings also including a process of chain analysis are explained. 
[0155]As shown in the figure, in Step S51, the video voice processing unit 10 
initializes chained list C^^^^ to a nil state, and sets segment number i as 1 in Step S52. 
[0156]Next, the video voice processing unit 10 distinguishes whether segment number 
i is smaller than the total segment n [ several ] in Step S53. 

[0157]Here, since the target segment [ i / segment number / processing unit / 10 / 
video voice ] when larger than the total segment n [ several ] does not exist, a series 
of processings are ended. 

[0158]On the other hand, segment number i in being smaller than the total segment n 
[ several ], In Step S54, the video voice processing unit 10 incorporates segment S^, 
and calculates chain C^;„ whose dissimilarity nature to segment S, is the minimum in 
Step S55 at segment Sj, i.e., here. Here, chain 0^,,, is defined like a following formula 



(12). 
[0159] 

[Equation 12] 





[0160]In an upper type (12), although dgc (C, S) expresses the dissimilarity nature 
metrics between the chain C and the segment S too, in link similar chain detection, 
this dissimilarity nature metrics dgc (C, S) is given with a following formula (13). 
[0161] 

[Equation 13] 



[0162]That is, unlike an upper type (1 1) which was used on the occasion of detection 
of a basic similar chain and which is dissimilarity nature metrics, dissimilarity nature 
metrics dgc (C, S) is given as dissimilarity nature between the segment concerned and 
an element segment of the last in the chain C. 

[0163]Next, the video voice processing unit 10 distinguishes whether minimum 
dissimilarity nature d^j„ is smaller than dissimilarity nature threshold delta^j^ in Step 
S56 using dissimilarity nature threshold delta^i^ which was mentioned above. 
[0164]Here when minimum dissimilarity nature d„5„ is larger than dissimilarity nature 
threshold deltag,^. In [ the video voice processing unit 10 shifts to processing of Step 
S61. generate new chain C^^w which has only the segment S, concerned as an only 
element segment, and ] Step S62, New chain C„^^ is added to chained list C,ist, and it 
shifts to processing of Step S58. 

[0165]On the other hand, when minimum dissimilarity nature d^j„ is smaller than 
dissimilarity nature threshold delta^j^, in Step S57, as for the video voice processing 
unit 10, the segment Sj concerned is added to an end of chain C^i„. That is, the video 
voice processing unit 10 is made into C„i„<-C^i„ and Sj. 

[0166]And the video voice processing unit 10 filters a chain in Step S58. That is, as 
mentioned above, about each element chain C**C,ist, the video voice processing unit 
10 measures quality of the chain C, chooses only a chain which has quality metrics 
which exceed a quality metrics threshold, and adds this to chained list Cuitered- The 
video voice processing unit 10 can also skip this process. 

[0167]The video voice processing unit 10 analyzes a chain sequentially in Step S59. 
That is, the video voice processing unit 10 lets filtered chained list C^itered the time 
pass to an analysis module. 

[0168]And in Step S60, the video voice processing unit 10 adds 1 to segment number 
i, and shifts to processing of Step S53. 

[0169]Thus, the video voice processing unit 10 until segment number i becomes larger 
than the total segment n [ several ], A series of above processings are repeated and 
each element chain of chained list C^^^^ at the time of segment number i becoming 
larger than the total segment n [ several ] is detected as a link similar chain. 
[0170]By such a series of processings, the video voice processing unit 10 can detect 
a link similar chain using such serial clustering technology. 





[0171]A series of processings shown in the figure are premised on the total segment n 
[ several ] contained in an inputted video data being known. However, generally the 
total segment n [ several ] is not given beforehand in many cases. In that case, the 
clustering algorithm should Just distinguish continuation or an end of processing by 
whether there is any input of a segment succeedingly in Step S53 in the said figure 
one by one. 

[01 72]Processing which detects next a periodic chain mentioned above is explained. It 
can be considered that periodic chain C^cKc ^^^^ •* whose k different basic 
similar chains or link similar chains settled. Hereafter, C (S) presupposes a segment in 

periodic chain Ccydic S,, . and that it is described as S„ and the chain number 1 of 

appearance origin of segment S k are shown, from this. Cgy^.,je is a periodic chain — 

if it becomes — C (S^), C (S2), and C (S„) — a row of a series of chain numbers. It 

will be described in i,, , i,^, i^ .... i^, i|, , and form of i,,. Here, i, i^ are 

permutation of the chain number 1 k, and arbitrary rows which will not overlap if it 

puts in another way by the one cycle. Below, the number of segments contained in 1 

cycle decides periodic chain i, which is one, i„ . and to call i, a fundamental-period 

chain. 

[0173]By the way, since cyclic structures in a video data are not the thoroughly 
congruous things and each cycle is usually approximate, the video voice processing 
unit 10 looks for an approximate periodic chain in a video data by a series of 
processings as shown in drawing 1 5 . Here, constraints that the video voice processing 
unit 10 must have a uniform fundamental-period chain which becomes origin of it if 
needed can be added. Here, processing performed on a basis of these constraints is 
explained. 

[01 74]First, in [ as the video voice processing unit 10 is shown in the figure ] Step S71 
and Step S72, A fundamental-period chain contained in a video data is detected, an 
initial chained list is generated based on it, and an initial chained list is updated so that 
all the fundamental-period chains further contained in an initial chained list may fulfill 
constraints of a uniform chain. 

[0175]That is, the video voice processing unit 10 calculates initial chained list C^^^^ in 
Step S71 using an algorithm which detects a basic similar chain or a link similar chain 
mentioned above. 

[0176]And in Step S72, about each chain C contained in an initial chained list, the 
video voice processing unit 10 checks that homogeneity, and the chain C divides it 
into two or more uniform subchains with which that time interval serves as the 
maximum in this chain C, when not uniform. Then, the video voice processing unit 10 is 
filtered using chain quality metrics which were explained in an algorithm which detects 
a basic similar chain or a link similar chain which mentioned an obtained uniform 
subchain above, A selected uniform subchain is added to initial chained list 0,45^. 
[0177]Next, in Step S73, the video voice processing unit 10 out of chained list Cy^^^. It 



overlaps in time and one pair of crossing chains, i.e., **C,, chain C, CglCCi^*^^ C,*'^ ** 
j-Q^surt Q^end] Becoming, and C2 are calculated. 

[0178]And the video voice processing unit 10 distinguishes whether such duplicate 
chain C, and C2 exist in Step S74. 

[0179]Here, when duplicate chain and do not exist, the video voice processing 
unit 10 ends a series of processings as that in which chained list Cu^^ has already 
contained two or more periodic chains. 

[0180]On the other hand, when duplicate chain C, and Cg exist, In order to determine 
whether the video voice processing unit 10 constitutes one periodic chain two chain 
C, and whose settled in Step S75 thru/or Step S78, Compatibility between each 
cycle is evaluated in a periodic chain which doubled the two periodic chains. 
[0181]That is, in Step S75, the video voice processing unit 10 doubles two chain 
and C2. and forms new periodic chain C^. Here, suppose that a segment in chain is 
expressed S,, Sj, .... S|Qf^|. 

[0182]Then, the video voice processing unit 10 sets the chain number C of 
appearance origin of segment S^ (S^) to C in Step S76, In row [ of a chain number ] C 
(S,). C (S2), C (S|cM|). fo'' every generating of C. That is, chain is decomposed into 

subchain C^V C^^ C^^ bordering on just before a segment belonging to the same 

chain as segment S, appears. As a result, the video voice processing unit 10 obtains a 

list of subchains as shown in a following formula (14). 

[0183] 

[Equation 14] 
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[0184]In an upper type (14), C(S|^+,) =C (S^) is realized about all the C^^ so that clearly 
from this operation. 

[0185]Then, the video voice processing unit 10 finds subchain 0^^""^^ with the highest 
frequency of occurrence in Step S77. That is, the video voice processing unit 10 
performs processing as shown in a following formula (15). 
[0186] 

[Equation 15] 
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[0187]And the video voice processing unit 10 evaluates whether subchain with 
the highest frequency of occurrence can become one cycle of the original chain in 
Step S78. Namely, as shown in a following formula (16X the video voice processing unit 
10 the consistency coefficient mesh. It defines by the ratio to the subchain total of 
the frequency of occurrence of Cm*^""'^ calculated at Step S76, and it is distinguished 
whether this consistency coefficient exceeds a predetermined threshold in continuing 
Step S79. 



[0189]Here. when the consistency coefficient is not over the threshold, the video 
voice processing unit 10 shifts to processing of Step S73, and repeats the same 
processing in quest of other duplicate chains. 

[0190]On the other hand, when the consistency coefficient is over the threshold, In 
[ the video voice processing unit 10 removes chain C, and from chained list C,jst in 
Step SBO, and ] Step S81, Chain is added to chained list C|j^^, and it shifts to 
processing of Step S73. 

[0191]The video voice processing unit 10 by repeating such a series of processings 
until the chain which overlaps about all the periodic chains contained in chained list 
C,is^ stops existing. Chained list C^^^^ containing a final periodic chain can be obtained. 
[0192]As mentioned above, the video voice processing unit 10 can detect the various 
chains of a similar segment using dissimilarity nature metrics and the extracted 
characteristic quantity. 

[0193]Below, chain analysis in the step S5 in drawing 5 is explained. The video voice 
processing unit 10 determines and outputs local video structure and/or global video 
structure of a video data using a detected chain. Here, although fundamental 
structural patterns by which it is generated in a video data are detected, a concrete 
example is given and explained about using a result of chain analysis how. 
[0194]First, a scene which is local structural patterns by which it is generated in a 
video data is explained. 

[0195]As mentioned above, a scene is a unit of the most fundamental local video 
structure positioned by higher rank from a level of a segment, and comprises a series 
of semantically related segments. The video voice processing unit 10 can detect these 
scenes using a chain. In scene detection in the video voice processing unit 10, 



[0188] 

[Equation 16] 
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conditions which a chain should fulfill are that a time interval between segments which 
continued mutually exceeds a certain defined value which is called a time threshold 
about no segments which the chain contains. Here, a chain which fulfills this condition 
is called a partial chain. 

[0196]The video voice processing unit 10 performs a series of processings as shown 
in drawing 16 , in order to detect a scene using a chain. 

[0197]First, the video voice processing unit 10 asks for a partial chained list in Step 
S91 thru/or Step S94, as shown in the figure. 

[0198]That is, the video voice processing unit 10 asks for 1 set of initial chained lists 
in Step S91 using basic similar chain detection algorithms mentioned above. 
[0199]Next, in Step S92, about each chain C in an initial chained list for which it asked, 
when the chain C is not a partial chain, the video voice processing unit 10. The chain 
C is decomposed into a row of partial subchain C=C| and ... which are the longest, and 
C„ in the condition range of a partial chain. 

[0200]Then, the video voice processing unit 1 0 removes the chain C from a chained 
list in Step S93. 

[0201]The video voice processing unit 10 adds each subchain C| to a chained list in 
Step S94. After this process is completed, all the chains become local. 
[0202]Next, in Step S95, the video voice processing unit 10 out of a chained list. One 
pair of duplicate chain Ci, which cross in time. That is, **C^, chain C, which is 
CjCC,^'"^ Cr"^ ** [Cg'^^^ Cg""^, and C2 are calculated. 

[0203]Then, the video voice processing unit 10 distinguishes whether such duplicate 
chain and C2 exist in Step S96. 

[0204]Here, when duplicate chain and C2 do not exist, the video voice processing 
unit 10 ends a series of processings as that in which one scene exists for every chain 
contained in a chained list. 

[0205]On the other hand, when duplicate chain C, and exist, in Step S97, the video 
voice processing unit 10 doubles duplicate chain C, and C2, and forms new chain C^. 
[0206]In Step S98, the video voice processing unit 10 removes chain and C2 
duplicate from a chained list, adds chain C,^, shifts to processing of Step S95 again 
after that, and repeats the same processing. 

[0207]When a duplicate chain stops existing in a chained list as a result of doing in this 
way, one scene will exist for each [ which was contained in a chained list obtained 
eventually ] chain of every. A boundary of scene Sj corresponding to chain Cj is given 
by C"*^^ and C""'. 

[0208] By the way, although some segments remain without being assigned to any 
chains, the video voice processing unit 10 summarizes such a segment that remained 
between two detected scenes as a default, and makes it one scene. 
[0209]By such a series of processings, the video voice processing unit 10 can detect 
a scene which is the local structural patterns in a video data by using a chain. 



[021 0]A case where such processing is applied to a conversation scene previously 
shown in drawing 2 is considered. In this case, the video voice processing unit 1 0 asks 
for a partial chain about each of a speaker's segment in Step S91 thru/or Step S94. 
And in Step S97, the video voice processing unit 10 will pack these chains, and will 
form a single large chain showing the whole scene. 

[021 1]Thus, the video voice processing unit 10 can detect a scene in a conversation 
scene. 

[0212]In the video voice processing unit 10. when a scene is detected, cautions are 
taken not to contain all segments in a scene in a chain. 

[0213]The video voice processing unit 10 can also detect a scene sequentially by 
performing sequentially an algorithm mentioned above. 

[0214]Below, a case where a news item is detected is explained as global structural 
patterns. 

[0215]As mentioned above, the news item starts with an introductory sentence by an 
anchor first, for example, and a news program has the cyclic structures that one or 
more reports from the spot continue. That is. it can be considered that such video 
structure is the simple cyclic structures which made one cycle just before [ from an 
anchor shot to ] the next anchor shot. 

[0216]The video voice processing unit 10 performs a series of processings in which an 
outline is shown in drawing 17 , in order to detect a news item automatically using a 
chain. 

[0217]First, the video voice processing unit 10 detects a periodic chain in Step SI 01 
using periodic chain detection algorithms mentioned above, as shown in the figure. By 
performing this process, the video voice processing unit 10 can obtain a list of 
periodic chains. Here, each cycle may express a news item and does not need to 
express it. 

[0218]Next, the video voice processing unit 10 removes all periodic chains of a place 
where the cycle is shorter than specified proportion of an overall length of a video 
data in Step SI 02. That is, the video voice processing unit 10 can eliminate a periodic 
chain of a short cycle without a chance of expressing a news item, by performing this 
process. Such a cycle may be generated, when a chairman interviews a guest, for 
example, or when other short-time cycles appear in newscasting. 
[0219]And in Step SI 03, the video voice processing unit 10 about all the periodic 
chains which remained in Step SI 02. When it asks for the time shortest periodic chain 
and this periodic chain laps with other periodic chains, that periodic chain is removed 
from a list of periodic chains. The video voice processing unit 10 repeats this 
processing until it is lost that any periodic chains lap with other periodic chains. A list 
of periodic chains which remained after this step SI 03 was completed will include a 
detected news item list. That is, each cycle of a list of periodic chains obtained at 
Step 103 expresses one news item, respectively. 



[0220]Thus, the video voice processing unit 10 can detect a news item automatically 
using a chain. 

[0221]in addition — it should mention especially — the video voice processing unit 10 
is being able to act satisfactorily, for example, also when a newscaster changes in the 
middle of newscasting called between each segment of the main of newscasting, a 
sport, and business. 

[0222]Below, a case where a play in a sportscast is detected is explained. 
[0223]Many sports have the feature of having the fixed pattern that a play is 
constituted, by repeating a series of same processes repeatedly. For example, in the 
case of baseball, a pitcher throws a ball, and a play is constituted when a batter tries 
to hit a ball. In a video data, football and Rugby are mentioned as other team sports 
which have such a play structure, for example. 

[0224]When this play structure is broadcast, a video data will express a repetition of a 
segment group about each portion of a play. That is, when a segment showing a batter 
continues after a segment with which a video data expresses a pitcher and a ball is hit, 
a segment showing an outfield player etc. will enter. Therefore, when chain detection 
by the video voice processing unit 10 is applied to baseball broadcast. In a video data, 
a segment showing a pitcher will be detected as one chain, one chain with an another 
segment showing a batter will be occupied, and other chains will hit the outfield and 
various scene. 

[0225]That is, in these sportscasts, play structure serves as a periodic image 
detectable using a periodic chain detecting method mentioned above. Tennis is 
mentioned as such other examples. In tennis, a video data constitutes a serve, a volley, 
a serve, and a cycle like a volley. In this case, since a segment showing each serve is 
mutually similar pictorially, such a segment can be used for the video voice processing 
unit 10 in order to detect a play. As a result, in structural analysis by the video voice 
processing unit 1 0, play structure of a game is approximately detectable. 
[0226]In other sports, especially an individual event, as a play structure, it will carry 
out until one contestant completes a certain activity, but the whole of each 
contestant can consider that the same activity is performed approximately. For 
example, in a ski-jumping game, each contestant performs a Jump once, the next 
contestant continues and the same jump is performed. That is, as for a video data in 
broadcast of a jump game, it is common for a contestant to start preparation of a jump, 
and to slide on an in-run, to get down, and to consist of a row of a segment of landing. 
From this, a video data comprises repeating such a series of segments for every 
contestant. When chain detection is applied to a video data in such broadcast, a series 
of chains similar for every stage of a jump will be detected. Therefore, a cycle for 
every contestant can be extracted using a periodic chain detecting method. 
[0227]In the video voice processing unit 10, when chain analysis detects a play in a 
sportscast automatically, in order to eliminate a chain which is not suitable, it may be 



necessary to provide the further restrictions. Although what kind of restrictions are 
appropriate changes with kinds of sport, an experiential rule of detecting only a thing 
in which the cycle is sufficiently long among detected periodic chains as a play can be 
used for the video voice processing unit 1 0, for example. 

[0228]That is, the video voice processing unit 1 0 performs a series of processings in 
which an outline is shown in drawing 18 , in order to detect a play in a sportscast 
automatically using a chain. 

[0229]First, the video voice processing unit 10 detects a periodic chain in Step S1 1 1 
using periodic chain detection algorithms mentioned above, as shown in the figure. 
[0230]And in Step S1 12, the video voice processing unit 10 applies sea damaged 
terms to a list of obtained chains, filters the chained list, and removes a chain which is 
not essential. Leaving only a periodic chain which is crossed to the great portion of 
program as sea damaged terms, for example is mentioned. Of course, the video voice 
processing unit 10 may add constraints peculiar to the target sport. 
[0231]Thus, the video voice processing unit 10 can detect a play in a sportscast 
automatically in chain analysis. 

[0232]Below, a case where a topic is detected combining periodic detection and scene 
detection is explained. 

[0233]For example, a video data in many TV programs, such as a drama, a comedy, 
and variety, is constituted by scene mentioned above, however, a video data 
comprises a row of some related scenes as a structure of the higher rank — a topic 
— it may have structure. This topic is not necessarily similar with a topic in 
newscasting which always starts in an introduction segment by a studio chairman. For 
example, as a visual example, a segment of a logo image or an anchorman's segment 
may be used instead of an introduction segment, or the theme music always same 
whenever a new topic starts as an auditory example may be passed. 
[0234]It can be judged by combining periodic detection and scene detection whether a 
video data in a certain program has such a topic structure. 

[0235]Therefore, the video voice processing unit 10 performs a series of processings 
in which an outline is shown in drawing 19 . in order to perform topic detection which 
combined periodic detection which used a chain, and scene detection. 
[0236]First, as shown in the figure, in Step SI 21, the video voice processing unit 10 
performs basic similar chain detection, and identifies 1 set of basic similar chained 
lists. 

[0237]Next, in Step SI 22, the video voice processing unit 10 performs periodic chain 
detection, and identifies a list of 1 set of periodic chains. 

[0238]Then. in Step SI 23, the video voice processing unit 10 applies an algorithm 
previously shown in drawing 16 using a basic similar chained list for which it asked in 
Step SI 21, and extracts scene structure. As a result, the video voice processing unit 
1 0 can obtain a list of scenes. 



[0239] And the video voice processing unit 10 is compared with each scene element 

which detected a list of periodic chains for which it asked in Step SI 22 in Step SI 23 

in Step SI 24. Here, the video voice processing unit 10 removes all periodic chains of a 

cycle shorter than a scene contained in a list of detected scenes. Although the 

remaining periodic chains obtained as a result have a scene of some [ cycle / each ], 

each of this cycle will be identified as a candidate topic, respectively. 

[0240]Thus. the video voice processing unit 10 can perform topic detection by 

combining periodic detection and scene detection which used a chain. 

[0241]The video voice processing unit 10 can also raise accuracy of topic detection 

by establishing other restrictions and sea damaged terms in Step SI 24. 

[0242]As mentioned above, the video voice processing unit 10 can determine and 

output various local video structures and/or various global video structures of a video 

data using detected various chains. 

[0243]As explained above, the video voice processing unit 10 shown as an 
embodiment of the invention can detect a similar chain which comprises two or more 
mutually similar video segments or sound segments. And the video voice processing 
unit 10 can extract video structure of a high level by analyzing these similar chains. 
Especially the video voice processing unit 10 can conduct analysis of local video 
structure and global video structure by a common framework. 

[0244]This video voice processing unit 10 can be processed automatically thoroughly, 
and a user does not need to know structure of the contents of the video data a priori. 
[0245]It is possible also for analyzing video structure sequentially by using successive 
chain detection, and if the video voice processing unit 10 has still more powerful count 
ability of a platform enough, it can conduct video structure analysis in real time. 
Thereby, the video voice processing unit 10 can be used also for video broadcast of 
the live besides a video data recorded a priori. For example, the video voice 
processing unit 10 is applicable to a sportscast of the live in play detection in a 
sportscast. 

[0246]The video voice processing unit 10 can give the foundation of new high-level 
access for video browsing, as a result of detecting video structure. That is, the video 
voice processing unit 10 enables access to a video data based on the contents by 
converting the contents of the video data into a video signal using video structure of a 
high level called not a segment but a topic. For example, by displaying a scene, the 
user can know a gist of a program quickly and the video voice processing unit 10 can 
find an interested portion promptly. 

[0247]The video voice processing unit 10 enables access of a powerful and new 
method to newscasting by using a result of topic detection in newscasting further 
again at a user, such as enabling selection and viewing and listening in a news item 
unit. 

[0248]The video voice processing unit 10 can give the foundation for creating an 



abstract of a video data automatically as a result of video structure detection. In order 
to create a coherent abstract generally, it is required not to combine arbitrary 
segments contained in a video data, but to decompose into an ingredient with a 
meaning which can reconstruct a video data, and to combine a suitable segment for 
origin for it. Video structure detected by the video voice processing unit 1 0 provides 
fundamental information for creating such an abstract. 

[0249]It is possible to analyze a video data according to the genre in the video voice 
processing unit 10. For example, the video voice processing unit 10 makes it possible 
to detect only a game of tennis. 

[0250]The video voice processing unit 10 makes it more possible by being included in 
a video editing system in a broadcasting station to edit a video data based on the 
contents than this. 

[0251]The video voice processing unit 10 can be used for analyzing home video or 
extracting video structure from home video automatically in an ordinary home further 
again. The video voice processing unit 10 can be used for performing an abstract of 
the contents of the video data, and edit based on the contents. 

[0252]On the other hand, the video voice processing unit 10 can be used as a tool 
supplementary to analysis of the contents of the video data according a video chain to 
a help. Navigation and video structure analysis of the contents of a video data can 
make easy especially the video voice processing unit 10 by converting a result of 
chain detection into a video signal. 

[0253]Since efficiency of the video voice processing unit 10 in which the algorithm is 
very simple and calculative is good, it is applicable also to household electronic 
equipment, such as a set top box, a digital video recorder, a home server. 
[0254]Characteristic quantity which this invention is not limited to an embodiment 
mentioned above, and is used for similarity measurement between segments, for 
example, the contents of the applicable video data, etc., Of course, except what was 
mentioned above may be sufficient, in addition it cannot be overemphasized that it 
can change suitably in the range which does not deviate from the meaning of this 
invention. 
[0255] 

[Effect of the Invention]As explained to details above, the signal processing method 
concerning this invention. It is a signal processing method which detects and analyzes 
the pattern reflecting the semantic structure of the contents of the supplied signal. 
The characteristic quantity extraction process of extracting at least one or more 
characteristic quantity showing the feature from the segment formed from a series of 
the continuous frame which constitutes a signal. The metrics which measure the 
similarity between the pairs of a segment are computed for every each of 
characteristic quantity using characteristic quantity. It has a detection process which 
detects the similar chain which comprises two or more segments mutually similar 



among segments using the similarity measuring process which measures the similarity 
between the pairs of a segment by these metrics, and characteristic quantity and 
metrics. 

[0256]Therefore, the signal processing method concerning this invention can detect 
the fundamental structural patterns which a segment similar in a signal constitutes, 
and can extract the structure of a high level by analyzing how these structural 
patterns are combined. 

[0257]The video voice processing unit concerning this invention is a video voice 
processing unit which detects and analyzes the pattern of the image reflecting the 
semantic structure of the contents of the supplied video signal, and/or a sound, The 
feature amount extracting means which extracts at least one or more characteristic 
quantity showing the feature from the image formed from a series of the continuous 
image which constitutes a video signal, and/or an audio frame, and/or a sound 
segment. The metrics which measure the similarity between the pairs of an image 
and/or a sound segment are computed for every each of characteristic quantity using 
characteristic quantity. The similarity measuring means which measures the similarity 
between the pairs of an image and/or a sound segment by these metrics. It has a 
detection means to detect the similar chain which comprises two or more images 
and/or sound segments mutually similar among an image and/or a sound segment, 
using characteristic quantity and metrics. 

[0258]Therefore, the video voice processing unit concerning this invention. It 
becomes it is possible to determine and output an image similar in a video signal 
and/or the fundamental structural patterns of a sound segment, and possible to 
extract the video structure of a high level by analyzing how these structural patterns 
are combined. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] It is a figure explaining the composition of the video data applied in this 
invention, and is a figure explaining the structure of the modeled video data. 
[Drawing 2] It is a figure explaining the similar chain which extracts local video 
structure. 

[Drawing 3] It is a figure explaining the similar chain which extracts global video 
structure. 

[Drawing 4] It is a block diagram explaining the composition of the video voice 
processing unit shown as an embodiment of the invention. 

[Drawing 5] In the video voice processing unit, it is a flow chart explaining a series of 



processes at the time of detecting and analyzing video structure. 

[Drawing 6] It is a figure explaining the amount sampling processing of dynamic 

features in the video voice processing unit. 

[Drawing 7] It is a figure explaining a basic similar chain. 

[Drawing 8] It is a figure explaining a link similar chain. 

[Drawing 9] It is a figure explaining a periodic chain. 

[Drawing 10] ln the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting a basic similar chain using batch clustering 
technology. 

[Drawing 1 1] lt is a figure explaining a dissimilarity nature threshold. 

[Drawing 12] In the video voice processing unit, it is a flow chart explaining a series of 

processes at the time of performing chain filtering of a basic similar chain. 

[Drawing 13] In the video voice processing unit, it is a flow chart explaining a series of 

processes at the time of detecting a basic similar chain using clustering technology 

one by one. 

[Drawing 14] In the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting a link similar chain. 

[Drawing 15] In the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting a periodic chain. 

[Drawing 16] In the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting a scene using a chain. 

[Drawing 1 7] ln the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting a news item using a chain. 

[Drawing 18] ln the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of detecting the play in a sportscast using a chain. 
[Drawing 19] ln the video voice processing unit, it is a flow chart explaining a series of 
processes at the time of performing topic detection which combined periodic 
detection and scene detection using the chain. 
[Description of Notations] 

10 A video voice processing unit and 1 1 [ A voice feature amount extraction part and 
15 / A segment characteristic quantity memory and 16 / A chain primary detecting 
element and 17 / A characteristic quantity similarity test section and 18 / Chain 
analyzing parts ] A video dividing part and 12 A video segment memory and 13 An 
image feature quantity extracting part and 14 



